Method and device for speech recognition decoding

文档序号：36621 发布日期：2021-09-24 浏览：35次中文

阅读说明：本技术 一种语音识别解码的方法及装置 (Method and device for speech recognition decoding ) 是由程高峰李鹏缪浩然石瑾张鹏远孙晓晨颜永红于 2021-05-28 设计创作，主要内容包括：本发明提供了一种语音识别解码的方法及装置。语音识别解码方法包括：确定待识别语音的N个子帧所对应的对数梅尔谱特征序列；通过经训练的神经网络编码器,处理所述对数梅尔谱特征序列,得到所述N个子帧各自对应的字符或者空白符的发射概率；根据预先确定的第一加权有限状态转移器以及所述N个子帧各自对应的的字符或者空白符的发射概率,采用束搜索算法搜索分数最高的词语序列。相比于传统的语音识别系统,本申请省略了帧级别对齐的流程,简化了训练和解码的流程；相比于端到端语音识别系统,在束搜索算法过程中使用加权有限状态转移器加快解码速度,高效地利用训练音频数据之外的文本数据,可以在多种领域快速部署语音识别系统。(The invention provides a method and a device for speech recognition decoding. The speech recognition decoding method comprises the following steps: determining logarithmic Mel spectrum characteristic sequences corresponding to N subframes of the voice to be recognized; processing the logarithmic Mel spectrum characteristic sequence by a trained neural network encoder to obtain the emission probability of characters or blank characters corresponding to the N sub-frames; and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes. Compared with the traditional voice recognition system, the method and the device omit the flow of frame level alignment, and simplify the flow of training and decoding; compared with an end-to-end voice recognition system, the decoding speed is accelerated by using the weighted finite state translator in the process of the beam search algorithm, text data except training audio data is efficiently utilized, and the voice recognition system can be rapidly deployed in various fields.)

1. A method of speech recognition decoding, comprising:

determining logarithmic Mel spectrum characteristic sequences corresponding to N subframes of the voice to be recognized;

processing the logarithmic Mel spectrum characteristic sequence by a trained neural network encoder to obtain the emission probability of characters or blank characters corresponding to the N sub-frames;

and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

2. The method of claim 1, further comprising:

establishing a second weighted finite state translator according to the rules of spelling words by characters and the collapse rules of continuous same characters and blank characters;

training a statistical language model based on an N-element grammar rule according to a text corpus training set, and establishing a third weighted finite state translator according to the statistical language model;

and carrying out composite processing on the second weighted finite state transitioner and the third weighted finite state transitioner to obtain the first weighted finite state transitioner.

3. The method according to claim 1 or 2, wherein the searching for the word sequence with the highest score by using a beam search algorithm according to the predetermined first weighted finite state transitioner and the transmission probability of the character or the blank character corresponding to each of the N subframes comprises: and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

4. The method according to claim 3, wherein the searching for the word sequence with the highest score by using a beam search algorithm according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the corresponding character or blank character of each of the N subframes specifically comprises:

taking out the emission probability P of the character or the blank character corresponding to the t-th subframe, and recording the highest score in all current tokens, wherein t is 1,2, … and N, and the score of a single token is the sum of the emission probability P and the weight of the token; wherein:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe;

a2: when t is larger than 1 and smaller than N, taking out a second token stored in a first set corresponding to the t-1 frame; copying the second token to all nodes of the t frame which are possibly transferred out and correspond to the node bound by the second token to be used as a third token; accumulating the weight on the transfer edge to the third token; if the difference value between the currently recorded highest score and the accumulated score of the third token is greater than the threshold value, deleting the third token; otherwise, saving the third token in the second set;

a3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, executing A4;

a4: accumulating the emission probability P corresponding to the t frame to a third token, and deleting a second token from a first set corresponding to the t-1 frame; if the current first set is an empty set, executing A5, otherwise executing A2;

a5: selecting each third token with the score of K before the ranking from the tokens stored in the second set, and storing the third token in the first set corresponding to the t frame, wherein the K value is greater than 1 and not greater than the node number of the t frame;

a6: and when the t is equal to N, selecting the token with the highest score in the first set, and backtracking the output of the transition edge which the token with the highest score passes through on the fourth weighted finite state transition device to form a word sequence.

5. An apparatus for speech recognition decoding, comprising:

the feature extraction module is used for determining logarithmic Mel spectrum feature sequences corresponding to N sub-frames of the voice to be recognized;

the neural network encoder module is used for processing the logarithmic Mel spectrum characteristic sequence to obtain the emission probability of characters or blank symbols corresponding to the N sub-frames;

and the identification module is used for searching the word sequence with the highest score by adopting a beam search algorithm according to the first weighted finite state transition device and the emission probability of the characters or the blank characters corresponding to the N subframes.

6. The apparatus of claim 5, further comprising an acquisition module to:

acquiring a second weighted finite state translator according to the rule of spelling words by characters and the collapse rule of continuous same characters and blank characters; training a statistical language model based on an N-element grammar rule according to a text corpus training set, and acquiring a third weighted finite state translator according to the statistical language model; and performing composite processing on the second weighted finite state translator and the third weighted finite state translator to obtain the first weighted finite state translator.

7. The apparatus according to claim 5 or 6, wherein the identifying module is configured to search the word sequence with the highest score by using a beam search algorithm according to a predetermined first weighted finite state transition and the transmission probability of the character or the blank corresponding to each of the N subframes, and comprises: and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

8. The apparatus of claim 7, wherein the identifying module is configured to perform the following steps:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe;

a3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, executing A4;

9. A speech recognition system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor for performing the speech recognition method of any of claims 1 to 4.

10. A storage medium on which program instructions are stored, which program instructions are operable when executed to perform a speech recognition method according to any one of claims 1 to 4.

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method and an apparatus for speech recognition decoding.

Background

The language communication is one of the most natural communication modes for human beings, and the research on computer voice of human beings covers voice coding and decoding, voice recognition, voice synthesis, speaker recognition, activated words, voice enhancement and the like. Among these areas, speech recognition is the most popular study. Early on in the computer invention, automated speech recognition was proposed, and early vocoders were considered as rudimentary forms of speech recognition and synthesis. Through decades of researches, the voice recognition technology has penetrated the aspects of our lives, and the application range covers the fields of smart homes, smart sound boxes, vehicle-mounted interaction, national security and the like.

The traditional speech recognition system is based on a classical Source-channel model (Source-channel model), and consists of an acoustic model, a pronunciation dictionary and a language model, and the acoustic model, the pronunciation dictionary and the language model are used for respectively modeling phonemes, words and sentences. And during decoding, a weighted finite state machine is adopted to integrate the probability distribution in the acoustic model, the pronunciation dictionary and the voice model, and the text content with the maximum probability corresponding to a section of voice signal is searched in the network. The traditional speech recognition system can achieve higher accuracy on hundreds of thousands of hours of training data, and has been widely applied in the industry. But as the scale of training data has grown to tens of millions of hours, the performance of conventional speech recognition systems has reached a bottleneck.

In recent years, sequence-to-sequence models based on deep neural networks have rapidly evolved in the field of speech and natural language processing, and end-to-end speech recognition frameworks based on encoders and decoders have therefore been proposed and extensively validated. Research reports show that the performance of an end-to-end based speech recognition system can exceed that of a traditional mixed Deep Neural Network (DNN) and Hidden Markov Model (HMM) in ten million hours of training data. The end-to-end speech recognition system omits the frame level alignment process and the pronunciation dictionary of the speech recognition system, thereby simplifying the training and decoding process. However, the end-to-end speech recognition system based on the encoder and the decoder has the following two disadvantages, one is that the beam search algorithm used in decoding is based on the neural network of the autoregressive structure, and the speed is obviously reduced compared with the traditional speech recognition decoding; secondly, the text corpora used for training are limited to the audio frequency with labels, the extra text corpora cannot be directly used, and the cross-domain recognition performance is obviously reduced.

Disclosure of Invention

In order to solve the above problems, the present application provides a speech recognition decoding method and apparatus.

In a first aspect, the present invention provides a method for speech recognition decoding, including:

determining logarithmic Mel spectrum characteristic sequences corresponding to N subframes of the voice to be recognized;

Preferably, a second weighted finite state translator is established according to rules of spelling words by characters and rules of collapsing consecutive identical characters and blank characters;

and carrying out composite processing on the second weighted finite state transitioner and the third weighted finite state transitioner to obtain the first weighted finite state transitioner.

Preferably, the searching the word sequence with the highest score by using a beam search algorithm according to the predetermined first weighted finite state transition device and the transmission probability of the character or the blank character corresponding to each of the N subframes includes: and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

Preferably, the searching for the word sequence with the highest score by using a beam search algorithm according to the predetermined weight on the first weighted finite state transition device and the emission probability of the character or the blank character corresponding to each of the N subframes specifically includes:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe;

a3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, executing A4;

In a second aspect, the present invention provides an apparatus for speech recognition decoding, comprising:

the feature extraction module is used for determining logarithmic Mel spectrum feature sequences corresponding to N sub-frames of the voice to be recognized;

Preferably, the system further comprises an obtaining module, configured to:

Preferably, the identifying module is configured to search a word sequence with the highest score by using a beam search algorithm according to a predetermined first weighted finite state transition and an emission probability of a character or a blank corresponding to each of the N subframes, and includes: and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

Preferably, the identification module is configured to perform the following steps:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe;

a3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, executing A4;

a6: and when the t is equal to N, selecting the token with the highest score in the first set, and backtracking the output of the transition edge passed by the token with the highest score on the first weighted finite state transition device to form a word sequence.

In a third aspect, the invention provides a speech recognition system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the speech recognition method according to any one of claims 1 to 4 when executed by the processor.

In a fourth aspect, the present invention provides a storage medium having stored thereon program instructions for performing, when executed, a speech recognition method according to any one of claims 1 to 4.

The technical scheme provided by the application omits the frame level alignment process of the voice recognition system, the weighted finite state translator is used for replacing a neural network in the process of the bundle search algorithm, the decoding speed is accelerated, text data except training audio data is efficiently utilized based on a statistical language model, and the performance of the end-to-end voice recognition system is improved

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of the technical solution provided in the embodiment of the present application;

FIG. 2 is a schematic illustration of a method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of determining a Mel-spectral feature sequence provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an obtaining process of a first weighted finite state transition provided in an embodiment of the present application;

FIG. 5 is a schematic flow chart of a word sequence with the highest search score provided in the embodiments of the present application;

fig. 6 is a schematic diagram of a speech recognition apparatus provided in an embodiment of the present application.

Detailed Description

The technical solution provided by the present invention is further described in detail below with reference to the accompanying drawings and embodiments.

Fig. 1 is a schematic view of an application scenario of the technical solution provided in the embodiment of the present application. As shown in fig. 1, the present solution can be applied to a scenario in which a piece of speech information is recognized as a word sequence after passing through a speech recognition method.

Fig. 2 is a schematic method diagram of a technical solution provided in an embodiment of the present application. As shown in fig. 2, the method process of speech recognition is:

s201: and determining a logarithmic Mel spectrum characteristic sequence corresponding to N sub-frames of the voice to be recognized.

Fig. 3 is a schematic flow chart of determining a mel-frequency spectrum feature sequence provided in the embodiment of the present application. Referring to fig. 3, the process of determining the mel-frequency spectrum feature sequence includes:

s2011: and pre-emphasis is carried out on the audio to be recognized, so that the high-frequency part is improved.

Specifically, pre-emphasis is performed on the audio, so that the influence of lip radiation can be removed, and the high-frequency resolution of the voice is increased.

S2012: the pre-emphasized audio is framed with 25 ms per frame and 10 ms frame shift.

Specifically, because the fourier transform is required for the speech, and the fourier transform requires that the input signal is stable, the audio needs to be framed, and the framing can ensure the short-time stationarity of the speech.

S2013: and windowing each frame after the frame division, wherein the window function is a Hamming window.

Specifically, windowing is performed by multiplying a window function, and then Fourier expansion is performed, so that the whole situation is more continuous, and the Gibbs effect is avoided. When windowing, the voice signal without periodicity originally can present partial characteristics of the periodic function.

S2014: and carrying out fast Fourier transform on each frame to obtain a frequency spectrum of each frame, and obtaining an energy spectrum of each frame according to the frequency spectrum of each frame.

Specifically, a fast fourier transform is performed on each frame, and a time domain signal can be converted into a frequency domain signal. The spectrogram can be obtained by stacking the frequency domain signals of each frame in time. From the frequency spectrum of each frame, an energy spectrum of each frame can be obtained.

S2015: and calculating the energy passing through the Mel filter for the energy spectrum of each frame, and taking the logarithm to obtain a logarithmic Mel spectrum, wherein the number of the Mel filters is 80.

Specifically, 80 mel filters are adopted, the energy spectrum of each frame passes through the 80 mel filters to obtain 80-dimensional filtered energy, logarithm of the 80-dimensional filtered energy of each frame is taken and then stacked, and the 80-dimensional logarithmic mel spectrum feature sequence of the voice to be recognized is obtained.

S202: and processing the Mel spectrum characteristic sequence through a trained neural network encoder to obtain the emission probability of characters or blank characters corresponding to the N sub-frames.

In some possible embodiments, the training process of the neural network encoder may be:

and establishing an audio training set with labels. And extracting the Mel spectrum characteristics of the audio training set through a characteristic extraction module. And inputting the Mel spectrum characteristics of the audio training set into a neural network to obtain a training result. The neural network comprises X layers of convolutional neural networks and Y layers of self-attention neural networks which are sequentially connected. Wherein X is a natural number greater than 1, and Y is a natural number greater than 1.

Specifically, an audio training set with labels is established, and a neural network encoder is trained based on the training set. Extracting a mel-frequency spectrum feature sequence of the audio training set through a signal processing and feature extraction module, and exemplarily, extracting an 80-dimensional logarithm mel-frequency spectrum feature sequence of the audio training set, wherein the sequence is represented as X ═ X₁，x₂，…，x_N]. Inputting the 80-dimensional logarithmic Mel spectrum characteristic sequence corresponding to the training audio into a neural network encoder.

Illustratively, the neural network encoder includes a 2-layer convolutional neural network and a 12-layer self-attention neural network. The step size of the convolution kernel in the convolutional neural network is 2, and a 256-dimensional convolution characteristic sequence is output and is expressed as C ═ C₁，c₂，…，c_T]. The length of the convolved signature sequence is 1/4 of the original input signature sequence. The convolution characteristic sequence is input into a 12-layer self-attention neural network and output into a 256-dimensional self-attention characteristic sequence which is expressed as H ═ H₁，h₂，…，h_T]. The self-attention feature sequence passes through a full-connection network and a softmax network to obtain the emission probability of the character or the blank character corresponding to each frame feature.

The training audio is converted into a character sequence corresponding to the text, which is shorter than the self-attention feature sequence, and the character sequence and the self-attention feature sequence are equal in length by filling continuous characters or blank characters. By collapsing the rules, the original character sequence can be recovered. The collapse rule is to merge consecutive characters into a single character and then remove all blank characters.

Fig. 4 is a schematic diagram of an obtaining process of the first weighted finite state transition provided in the embodiment of the present application.

In some possible embodiments, please refer to fig. 4, the first weighted finite state transition may be obtained by the following steps. Namely, the method provided in the embodiment of the present application may further include:

step 401: and establishing a second weighted finite state translator according to the rules of spelling words of the characters and the collapse rules of the continuous same characters and the blank characters.

Specifically, after the training of the neural network encoder is completed, a second weighted finite state translator corresponding to the language model is established according to the network of characters spelling words and the collapsing network of continuous same characters and blank characters.

Illustratively, first, a weighted finite state transitioner of each character is established, and the weighted finite state transitioners of all characters are combined into a new weighted finite state transitioner through a combination algorithm, and the weighted finite state transitioner is used as a collapsing network.

Specifically, the collapsed network is defined by an 8-tuple, which is denoted by (Σ，Q₁,，I₁，F₁,，E₁，λ₁，ρ₁) Wherein a set is inputAre characters and blanksIs the set of characters, Q₁Is a finite set of states, I₁Is Q₁Is a set of initial states, F₁Is Q₁Is a set of termination states, λ₁Is a weighted initial function, p₁Is a function for terminating the weight,is a finite set of transitions, epsilon is a meta-symbol label, representing unsigned input or output,is a set of log domain weight elements. The collapsing network is formed by a plurality of addersThe weight finite state transitioner is composed by a standard merging algorithm. The log domain weight is 0 throughout the collapsed network.

Secondly, spelling rules of all words in the voice recognition system are specified, and a spelling network is constructed.

Specifically, the words of the Chinese system are composed of single Chinese spelling, the words of the English system are composed of single letter spelling, and a weighted finite state transfer device, namely a spelling network, is established according to spelling rules. The spelling network is also defined by an 8-tuple, denoted as (Σ, Δ, Q)₂，I₂，F₂，E₂，λ₂，ρ₂) Where the input set sigma is a character set, the output set delta is a word set, Q₂Is a finite set of states, I₂Is Q₂Is a set of initial states, F₂Is Q₂Is a set of termination states, λ₂Is the initial weight, p₂Is the weight of the termination and is,is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements. Since each word corresponds to only one spelling, the log-domain weight is 0.

And secondly, establishing a spelling dictionary, namely a second weighted finite state translator according to a spelling network of characters for spelling words and a collapsing network of continuous same characters and blank characters through a standard composite algorithm. The second weighted finite state transition is defined by an 8-tuple, denoted as: (Δ，Q₃，I₃，F₃，E₃，λ₃，ρ₃) Wherein a set is inputAre characters and blanksIs the set of words, Q₃Is a finite set of states, I₃Is Q₃Is a set of initial states, F₃Is Q₃Is a set of termination states, λ₃Is the initial weight, p₃Is the weight of the termination and is,is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements. The second weighted finite state transition is optimized using a standard deterministic algorithm and a minimization algorithm.

Step 402: and training a statistical language model based on the N-element grammar rule according to the text corpus training set, and establishing a third weighted finite state translator according to the statistical language model.

Specifically, a text corpus training set is established, and for a Chinese corpus, words are segmented according to a word set used by a Chinese system; for English corpora, words that do not belong to the set of words are mapped to individual symbols. And training a language model based on the N-element statistics by using the processed text corpus, and converting the language model into a weighted finite state translator by using a standard algorithm, namely establishing a third weighted finite state translator.

Illustratively, a third weighted finite state machine is built using a 3-dimensional statistical language model, the third weighted finite state machine being defined by an 8-dimensional group of cells, denoted as (Δ, Q)₄，I₄，F₄，E₄，λ₄，ρ₄) Wherein the input set and the output set are both sets of words Δ, Q₄Is a finite set of states, I₄Is Q₄Is a set of initial states, F₄Is Q₄Is a subset ofA set of termination states, λ₄Is the initial weight, p₄Is the weight of the termination and is, is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements. The log domain weight on the entire third weighted finite state transitioner is the log value of the probability corresponding to the 3-gram.

Step 403: and carrying out composite processing on the second weighted finite state transitioner and the third weighted finite state transitioner to obtain a first weighted finite state transitioner.

Illustratively, the second weighted finite state transition and the third weighted finite state transition are combined into the first weighted finite state transition by a standard combination algorithm. The first weighted finite state translator is defined by an 8-tuple, denoted as: (△，Q₅，I₅，F₅，E₅，λ₅，ρ₅) Wherein a set is inputAre characters and blanksIs the set of words, Q₅Is a finite set of states, I₅Is Q₅Is a set of initial states, F₅Is Q₅Is a set of termination states, λ₅Is the initial weight, p₅Is the weight of the termination and is, is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements.

The first weighted finite state transitioner is optimized using a deterministic algorithm and a minimization algorithm, the optimized first weighted finite state transitioner being used for subsequent decoding.

S203: and searching the word sequence with the highest score by adopting a beam search algorithm according to the predetermined first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

In a more specific embodiment, a beam search algorithm is used to search the word sequence with the highest score according to the predetermined weight on the first weighted finite state transitioner and the emission probability of the characters or the blank characters corresponding to the N subframes.

Fig. 5 is a schematic flow chart of a word sequence with the highest search score provided in the embodiment of the present application.

In a more specific embodiment, referring to fig. 5, searching the word sequence with the highest score by using the beam search algorithm may include:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe.

And when t is 1, namely the t-th frame at the moment is the first frame, initializing the token, and binding the initialized token to the initial node. And adding the initialized token to the first set corresponding to the first frame.

A2: when t is larger than 1 and smaller than N, taking out a second token stored in a first set corresponding to the t-1 frame; copying the second token to all nodes of the t frame which are possibly transferred out and correspond to the node bound by the second token to be used as a third token; accumulating the weight on the transfer edge to the third token; if the difference value between the currently recorded highest score and the accumulated score of the third token is greater than the threshold value, deleting the third token; otherwise, the third token is saved in the second set.

A3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, a4 is executed.

A4: accumulating the emission probability P corresponding to the t frame to a third token, and deleting a second token from a first set corresponding to the t-1 frame; if the current first set is an empty set, executing A5; otherwise, a2 is executed.

A5: and selecting each third token with the score of K before the ranking from the tokens stored in the second set, and storing the third token in the first set corresponding to the t-th frame, wherein the K value is greater than 1 and not greater than the node number of the t-th frame.

A6: and when the t is equal to N, selecting the token with the highest score in the first set, and backtracking the output of the transition edge which the token with the highest score passes through on the fourth weighted finite state transition device to form a word sequence.

And when t is equal to N, namely the t-th frame at the moment is the last frame, selecting the token with the highest score in the first set, and backtracking the output of the transition edge of the token with the highest score on the first weighted finite state transition device to form a word sequence as a final recognition result.

Compared with the traditional voice recognition system, the method and the device omit the process of frame level alignment of the voice recognition system, and simplify the process of training and decoding; compared with an end-to-end voice recognition system based on an encoder and a decoder, the weighted finite state translator is used for replacing a neural network in the process of the bundle search algorithm, the decoding speed is accelerated, text data except training audio data is efficiently utilized, the performance of the end-to-end voice recognition system is improved, and the voice recognition system can be rapidly deployed in various fields.

Fig. 6 is a schematic diagram of a speech recognition apparatus provided in an embodiment of the present application. As shown in fig. 6, the speech recognition apparatus includes a feature extraction module 601, a neural network encoder module 602, and a recognition module 604.

The feature extraction module 601 is configured to determine log mel-frequency spectrum feature sequences corresponding to N subframes of a speech to be recognized.

Referring to fig. 3, the process of determining a log mel-frequency spectrum feature sequence by the feature extraction module 601 may include:

s2011: and pre-emphasis is carried out on the audio to be recognized, so that the high-frequency part is improved.

Specifically, pre-emphasis is performed on the audio, so that the influence of lip radiation can be removed, and the high-frequency resolution of the voice is increased.

S2012: the pre-emphasized audio is framed with 25 ms per frame and 10 ms frame shift.

S2013: and windowing each frame after the frame division, wherein the window function is a Hamming window.

The neural network encoder module 602 is configured to process the log mel-frequency spectrum feature sequence to obtain the emission probability of the character or the blank symbol corresponding to each of the N subframes.

Specifically, an audio training set with labels is established, and a neural network encoder is trained based on the training set. The mel-frequency spectrum feature sequence of the audio training set is extracted by the feature extraction module 601, for example, an 80-dimensional logarithmic mel-frequency spectrum feature sequence of the audio training set is extracted, and is represented as X ═ X₁，x₂，…，x_n]. Inputting the 80-dimensional logarithmic Mel spectrum characteristic sequence corresponding to the training audio into a neural network encoder.

In some possible embodiments, the speech recognition apparatus further comprises an obtaining module 603. The obtaining module 603 is configured to obtain a second weighted finite state transitioner according to a rule of spelling a word with a character and a collapse rule of consecutive identical characters and blank characters.

Specifically, after the training of the neural network encoder is completed, the obtaining module 603 establishes a second weighted finite state translator corresponding to the language model according to the network of the spelling words of the characters and the collapsing network of the continuous same characters and the blank characters.

Illustratively, first, a weighted finite state transitioner for each character is established by the obtaining module 603, and the weighted finite state transitioners for all characters are merged into a new weighted finite state transitioner by a merging algorithm, and the weighted finite state transitioner is used as a collapsing network.

Specifically, the collapsed network is defined by an 8-tuple, which is denoted by (Σ，Q₁,，I₁，F₁,，E₁，λ₁，ρ₁) Wherein a set is inputAre characters and blanksIs the set of characters, Q₁Is a finite set of states, I₁Is Q₁Is a set of initial states, F₁Is Q₁Is a set of termination states, λ₁Is a weighted initial function, p₁Is a function for terminating the weight,is a finite set of transitions, epsilon is a meta-symbol label, representing unsigned input or output,is a set of log domain weight elements. The collapsing network is composed of a plurality of weighted finite state shifters through a standard merging algorithm. The log domain weight is 0 throughout the collapsed network.

Second, a spelling network is constructed by the obtaining module 603 specifying spelling rules for all words in the speech recognition system.

Next, the obtaining module 603 builds a spelling dictionary, i.e., a second weighted finite state translator, according to the spelling network of the words spelled by the characters and the collapsing network of consecutive identical characters and blanks by a standard composite algorithm. The second weighted finite state transition is defined by an 8-tuple, denoted as: (Δ，Q₃，I₃，F₃，E₃，λ₃，ρ₃) Wherein a set is inputAre characters and blanksIs the set of words, Q₃Is a finite set of states, I₃Is Q₃Is a set of initial states, F₃Is Q₃Is a set of termination states, λ₃Is the initial weight, p₃Is the weight of the termination and is,is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements. The second weighted finite state transition is optimized using a standard deterministic algorithm and a minimization algorithm.

In some possible embodiments, the obtaining module 603 is further configured to train a statistical language model based on the N-gram rule according to the corpus training set, and obtain a third weighted finite state translator according to the statistical language model.

Specifically, a text corpus training set is established through the obtaining module 603. For Chinese linguistic data, dividing words according to a word set used by a Chinese system; for English corpora, words that do not belong to the set of words are mapped to individual symbols. And training a language model based on the N-element statistics by using the processed text corpus, and converting the language model into a weighted finite state translator by using a standard algorithm, namely establishing a third weighted finite state translator.

Illustratively, a third weighted finite state machine is built using a 3-dimensional statistical language model, the third weighted finite state machine being defined by an 8-dimensional group of cells, denoted as (Δ, Q)₄，I₄，F₄，E₄，λ₄，ρ₄) Wherein the input set and the output set are both sets of words Δ, Q₄Is a finite set of states, I₄Is Q₄Is a set of initial states, F₄Is Q₄Is a set of termination states, λ₄Is the initial weight, p₄Is the weight of the termination and is, is a finite set of branches, epsilon is a meta-symbol tag,is a set of log domain weight elements. The log domain weight on the entire third weighted finite state transitioner is the log value of the probability corresponding to the 3-gram.

In some possible embodiments, the obtaining module 603 is further configured to perform a composite process on the second weighted finite state transition and the third weighted finite state transition to obtain the first weighted finite state transition.

The identifying module 604 is configured to search for a word sequence with a highest score by using a beam search algorithm according to the first weighted finite state transitioner and the transmission probabilities of characters or blank symbols corresponding to the N subframes, respectively.

In a more specific example, the identifying module 604 is specifically configured to search the word sequence with the highest score by using a beam search algorithm according to a predetermined weight on the first weighted finite state transitioner and the emission probability of the character or the blank corresponding to each of the N subframes.

In a more specific example, referring to fig. 6 and fig. 5, the searching, by the identifying module 504, for the word sequence with the highest score by using the beam search algorithm specifically includes:

a1: when t is 1, binding a preset first token to the initial node, and adding the first token to a first set corresponding to the 1 st subframe;

and when t is 1, namely the t-th frame at the moment is the first frame, initializing the token, and binding the initialized token to the initial node. And adding the initialized token to the first set corresponding to the first frame.

A3: judging whether the input on the current transfer edge is a meta symbol, if so, executing A2; otherwise, a4 is executed.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the above-described embodiments may be implemented by hardware, software modules executed by a processor, or a combination of both. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

19页详细技术资料下载

Method and device for speech recognition decoding

相关技术

网友询问留言