Minnan language voice recognition method, system, equipment and medium

文档序号：170834 发布日期：2021-10-29 浏览：43次中文

阅读说明：本技术 一种闽南语语音识别方法、系统、设备及介质 (Minnan language voice recognition method, system, equipment and medium ) 是由欧智坚刘岩肖吉孙磊于 2021-06-02 设计创作，主要内容包括：本发明公开了一种闽南语语音识别方法、系统、设备及介质,使用普通话音素作为建模单元对闽南语进行识别,相较于传统的使用闽南语音素作为建模单元,大幅减少了音素序列的数量,降低了基于音素的n-gram语言模型的复杂度,降低了工作量,从而提高了建模效率；同时,在目标函数中引入条件随机场CRF,CTC的状态后验可以看作是条件随机场的点势能,状态与状态之间的联系可以通过边势能引入,改善了词错误率水平,提高了声学模型的性能,从而提高了识别准确率。(Compared with the traditional method of using southern Fujian phonemes as a modeling unit, the method has the advantages that the number of phoneme sequences is greatly reduced, the complexity of a phoneme-based n-gram language model is reduced, the workload is reduced, and the modeling efficiency is improved; meanwhile, the conditional random field CRF is introduced into the objective function, the state posterior of the CTC can be regarded as the point potential energy of the conditional random field, and the link between the states can be introduced through the edge potential energy, so that the word error rate level is improved, the performance of the acoustic model is improved, and the recognition accuracy is improved.)

1. A Minnan language voice recognition method is characterized by comprising the following steps:

acquiring Minnan language voice original data, and extracting a voice feature sequence in the original data; inputting the voice feature sequence into a target model to obtain probability distribution of the voice feature sequence corresponding to different prediction phoneme sequences;

mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm, and recording the sequence as T.fst;

acquiring Chinese character information of southern Fujian, and converting the Chinese character information of southern Fujian into a plurality of Chinese phoneme sequences; training a plurality of Chinese phoneme sequences through a denominator LM to obtain a phoneme-based n-gram language model, and recording as G.fst;

performing combined operation on the T.fst and the G.fst to obtain a sub-parent graph which is marked as fst;

calculating to obtain the probability corresponding to the optimal predicted phoneme sequence through an objective function according to the probability distribution and the sub-mother graph, and performing back propagation training through the value of the objective function to obtain a trained acoustic model;

during recognition, the speech feature sequence to be recognized is input into the trained acoustic model to obtain the probability that the speech feature sequence to be recognized corresponds to different predicted phoneme sequences, and then the optimal recognition result is obtained by decoding and searching in combination with the language model.

2. The method of claim 1, wherein the extracting the speech feature sequence comprises:

extracting Fbank characteristics from the original data, and acquiring first-order difference characteristics and second-order difference characteristics of the Fbank characteristics;

splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature;

performing cepstrum mean variance normalization processing on the spliced features;

and performing down-sampling on the features subjected to the cepstrum mean variance normalization processing to obtain the voice feature sequence.

3. The method of claim 1, wherein the step of mapping the chinese dictionary to the corresponding phoneme notation sequence through a CTC algorithm comprises:

converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

adding blank symbols in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

4. The Minnan speech recognition method according to any one of claims 1-3, wherein the objective function is defined by maximizing conditional likelihood, and the specific expression of the objective function is as follows:

wherein, J_all(theta) represents an objective function, alpha represents a co-factor, x represents an input speech feature sequence, theta represents a model parameter, pi_mRepresenting the M-th predicted phoneme sequence corresponding to the speech characteristic sequence x, M representing the number of different predicted phoneme sequences corresponding to the speech characteristic sequence x, l representing the phoneme label sequence corresponding to the speech characteristic sequence x, p (pi)_m| x) indicates that for an input speech feature sequence of x, the output predicted phoneme sequence is π_mThe probability of time, p (l | x; theta), represents the probability of the time when the input speech feature sequence is x and the output phoneme label sequence is l under the model parameter theta;

in the decoding search, the scoring function adopted by the identification result is as follows:

S＝logp(l|x)+βlogp_w(l)

5. A southern Fujian speech recognition system, comprising:

the probability distribution acquisition module is used for acquiring original Minnan speech data and extracting a speech feature sequence in the original data; inputting the voice feature sequence into a target model to obtain probability distribution of the voice feature sequence corresponding to different prediction phoneme sequences;

the first finite state converter module is used for mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm and recording the phoneme labeling sequence as T.fst;

the second finite state converter module is used for acquiring Chinese character information of Minnan and converting the Chinese character information of Minnan into a plurality of Chinese phoneme sequences; training a plurality of Chinese phoneme sequences through a denominator LM to obtain a phoneme-based n-gram language model, and recording as G.fst;

a denominator graph constructing module, configured to perform a combination operation on the t.fst and the g.fst to obtain a denominator graph, which is denoted as fst;

the target function calculation module is used for calculating the probability corresponding to the optimal predicted phoneme sequence through a target function according to the probability distribution and the sub-mother graph, and performing back propagation training through the value of the target function to obtain a trained acoustic model;

and the recognition module is used for inputting the voice feature sequence to be recognized into the trained acoustic model during recognition to obtain the probability that the voice feature sequence to be recognized corresponds to different predicted phoneme sequences, and then decoding and searching to obtain the optimal recognition result by combining the language model.

6. The Minnan speech recognition system of claim 5, wherein the probability distribution obtaining module is specifically configured to:

extracting Fbank characteristics from the original data, and acquiring first-order difference characteristics and second-order difference characteristics of the Fbank characteristics;

splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature;

performing cepstrum mean variance normalization processing on the spliced features;

and performing down-sampling on the features subjected to the cepstrum mean variance normalization processing to obtain the voice feature sequence.

7. The Minnan Speech recognition system of claim 5, wherein the first finite State transducer module is specifically configured to:

converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

adding blank symbols in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

8. The Minnan speech recognition system of any one of claims 5-7, wherein in the objective function computation module, the specific expression of the objective function is:

wherein, J_all(theta) represents an objective function, alpha represents a co-factor, x represents an input speech feature sequence, theta represents a model parameter, pi_mRepresenting the M-th predicted phoneme sequence corresponding to the speech characteristic sequence x, M representing the number of different predicted phoneme sequences corresponding to the speech characteristic sequence x, l representing the phoneme label sequence corresponding to the speech characteristic sequence x, p (pi)_m| x) indicates that for an input speech feature sequence of x, the output predicted phoneme sequence is π_mThe probability of time, p (l | x; theta), represents the probability of when the input speech feature sequence is x and the output phoneme notation sequence is l under the model parameter theta.

9. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor when executing the computer program implements the steps of the southern Fujian speech recognition method of any of claims 1-4.

10. A medium having a computer program stored thereon, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the method of Speech recognition in Minnan according to any one of claims 1 to 4.

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a Minnan language voice recognition method, system, equipment and medium.

Background

With the development of artificial intelligence technology, speech recognition technology has made great progress and is beginning to enter various fields such as household appliances, communication, automobiles, medical treatment and the like. The speech recognition is that human speech is processed and recognized by a machine to be converted into corresponding texts, and then semantic analysis and understanding are carried out, so that the machine can understand the voice like a human.

The existing speech recognition system mainly comprises an acoustic model, a language model and a decoder. Specifically, after training voice data with labels are input into a voice recognition system, acoustic feature vectors are obtained through feature extraction; the acoustic feature vectors obtain each pronunciation unit corresponding to each acoustic feature vector through an acoustic model, and the decoder obtains a final voice recognition result according to each pronunciation unit and the language model.

Although mandarin chinese recognition has reached a practical level, until now, a speech recognition system taking southern Fujian as a research object has no mature and reliable solution yet, and the system for realizing southern Fujian speech recognition can provide friendly localized services for various districts of southern Fujian and can also provide a more convenient environment for southern Fujian teaching and culture propagation of southern Fujian. The number of the initial consonants of Minnan is 18, the number of the final consonants is at least 85, the number of the tones is 7, the number of the syllables which can be combined and used by the initial consonants, the final consonants and the tones is about 2300, and compared with the syllables which are only 1300 in the Putonghua, the syllables which are nearly doubled are increased.

At present, the southern Fujian speech recognition is mainly realized by constructing an initial table, a final table and a tone table of the southern Fujian, namely, Chinese characters of the southern Fujian are directly converted into a southern Fujian phoneme sequence, and the work load of converting the southern Fujian phoneme sequence is large due to the large number of syllables of the southern Fujian. Meanwhile, the acoustic model is used as a main module of the Minnan speech recognition system, and the performance of the acoustic model directly determines the recognition accuracy of the Minnan speech recognition system. At present, due to the fact that the acoustic model training precision is low due to the fact that voice data are wrongly labeled and the like, the performance of the acoustic model is low, and therefore the recognition accuracy is low.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a medium for recognizing Minnan speech, which are used for overcoming the problems of large workload and low efficiency of the conventional Minnan speech recognition system and the problem of low recognition accuracy caused by low performance of an acoustic model.

In a first aspect, the invention provides a southern Fujian speech recognition method, which comprises the following steps:

acquiring Minnan language voice original data, and extracting a voice feature sequence in the original data; inputting the voice feature sequence into a target model to obtain the probability distribution of the voice feature sequence corresponding to different prediction phoneme sequences;

mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm, and recording the sequence as T.fst;

acquiring Chinese character information of southern Fujian, and converting the Chinese character information of southern Fujian into a plurality of Chinese phoneme sequences; training a plurality of Chinese phonetic element sequences through a denominator LM to obtain a phoneme-based n-gram language model, and recording as G.fst;

performing combined operation on the T.fst and the G.fst to obtain a sub-parent graph which is marked as fst;

Further, the extraction process of the voice feature sequence is as follows:

extracting Fbank characteristics from the original data, and acquiring first-order difference characteristics and second-order difference characteristics of the Fbank characteristics;

splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature;

performing cepstrum mean variance normalization processing on the spliced features;

and performing down-sampling on the features subjected to the cepstrum mean variance normalization processing to obtain the voice feature sequence.

Further, the specific step of mapping the chinese dictionary to the corresponding phoneme annotation sequence through the CTC algorithm is as follows:

converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

adding blank characters in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

Further, the objective function is defined by maximizing conditional likelihood, and the specific expression of the objective function is as follows:

wherein, J_all(theta) represents an objective function, alpha represents a co-factor, x represents an input speech feature sequence, theta represents a model parameter, pi_mRepresenting the M-th predicted phoneme sequence corresponding to the speech feature sequence x, M representing the number of different predicted phoneme sequences corresponding to the speech feature sequence x, l representing the phoneme notation sequence corresponding to the speech feature sequence x, p (pi)_m| x) indicates that for an input speech feature sequence of x, the output predicted phoneme sequence is π_mThe probability of time, p (l | x; θ), represents the probability of when the input speech feature sequence is x and the output phoneme label sequence is l under the model parameter θ.

Further, in the decoding search, the scoring function adopted by the identification result is as follows:

S＝logp(l|x)+βlogp_w(l)

wherein S represents the score of the recognition result, and p (l | x) tableShowing the probability that the speech feature sequence to be recognized output by the acoustic model is x and the corresponding phoneme labeling sequence is l, beta represents an influence coefficient, and p_w(l) The probability of the phoneme notation sequence l in the language model representing the word level.

In a second aspect, the present invention provides a southern Fujian speech recognition system, comprising:

the probability distribution acquisition module is used for acquiring original Minnan speech data and extracting a speech feature sequence in the original data; inputting the voice feature sequence into a target model to obtain probability distribution of different prediction phoneme sequences corresponding to the voice feature sequence;

a first finite state converter module, which is used for mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm and recording the sequence as T.fst;

a denominator graph constructing module, configured to perform a combination operation on the t.fst and the g.fst to obtain a denominator graph, which is denoted as fst;

the target function calculation module is used for calculating the probability corresponding to the optimal prediction phoneme sequence through a target function according to the probability distribution and the denominator graph, and performing back propagation training through the value of the target function to obtain a trained acoustic model;

Further, the probability distribution obtaining module is specifically configured to:

extracting Fbank characteristics from the original data, and acquiring first-order difference characteristics and second-order difference characteristics of the Fbank characteristics;

splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature;

performing cepstrum mean variance normalization processing on the spliced features;

and performing down-sampling on the features subjected to the cepstrum mean variance normalization processing to obtain the voice feature sequence.

Further, the first finite state transducer module is specifically configured to:

converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

adding blank characters in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

Further, in the objective function calculation module, the specific expression of the objective function is as follows:

wherein, J_all(theta) represents an objective function, alpha represents a co-factor, x represents an input speech feature sequence, theta represents a model parameter, pi_mRepresenting the M-th predicted phoneme sequence corresponding to the speech feature sequence x, M representing the number of different predicted phoneme sequences corresponding to the speech feature sequence x, l representing the phoneme notation sequence corresponding to the speech feature sequence x, p (pi)_m| x) indicates that for an input speech feature sequence of x, the output predicted phoneme sequence is π_mThe probability of time, p (l | x; θ), represents the probability of when the input speech feature sequence is x and the output phoneme label sequence is l under the model parameter θ.

In a third aspect, the present invention provides an apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for speech recognition in southern Fujian as described in the first aspect when executing the computer program.

In a fourth aspect, the present invention provides a medium having stored thereon a computer program that, when being executed by a processor, carries out the steps of the method for speech recognition in southern Fujian language of the first aspect.

Advantageous effects

Compared with the prior art, the invention has the advantages that:

compared with the traditional method for identifying the Minnan language by using the Minnan phonemes as the modeling unit, the technical scheme provided by the invention greatly reduces the number of phoneme sequences, reduces the complexity of a phoneme-based n-gram language model, and reduces the workload, thereby improving the modeling efficiency; meanwhile, the conditional random field CRF is introduced into the objective function, the state posterior of the CTC can be regarded as the point potential energy of the conditional random field, and the link between the states can be introduced through the edge potential energy, so that the word error rate level is improved, the performance of the acoustic model is improved, and the recognition accuracy is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of acoustic model training in an embodiment of the present invention;

FIG. 2 is a representation of conditional random fields of CTCs in an embodiment of the present invention;

FIG. 3 is a conditional random field used by the CTC-CRF in an embodiment of the present invention;

fig. 4 is a flow chart of southern min speech recognition in an embodiment of the present invention.

Detailed Description

The technical solutions in the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For convenience of understanding, words appearing in the embodiments of the present invention are explained.

Syllable: the basic unit of the voice can be clearly distinguished by hearing, the basic unit is a normal pronunciation unit of a human, and syllables have obvious perceptible boundary; in the Chinese syllables, the pronunciation of a Chinese character is a syllable, and exemplarily, the syllable corresponding to the Chinese text 'you' is { ni };

phoneme: the phonetic symbol is the smallest unit or smallest voice segment which forms syllables, is the smallest linear voice unit which is divided from the aspect of voice quality, and the boundary between phonemes is fuzzy and is seriously related to the context;

the state is as follows: the phoneme is artificially divided into a plurality of states without physical significance, the characteristics of the phoneme in a single state can be considered to be kept stable, and the boundary between the states is fuzzy and is seriously related to the context;

CTC: connectionist Temporal Classification, the connection principle Temporal Classification. The CTC algorithm can give very many Y conditional probability outputs for an input X, and although the CTC algorithm does not require strict alignment of inputs and outputs, a mapping that aligns inputs and outputs is required to facilitate training of the model.

FST: finish State transmitter, Finite State Transducer. The FST may describe a set of regular transitions or a set of symbol sequences to another symbol sequence.

WFST: weighted Fine State Transducer, Weighted Finite State Transducer. Each state transition has a weight, each initial state has an initial weight, each termination state has a termination weight, the weights are generally probabilities or losses of the transition or initial/termination states, and the weights are accumulated along each path and accumulated at different paths.

CRF: conditional Random Field.

The technical solution of the present application will be described in detail below with specific examples. Several of these specific embodiments may be combined, and details of the same or similar concepts or processes may not be repeated in some embodiments.

In the first aspect, for a clearer description of the scheme, before introducing the southern Fujian speech recognition method provided by the embodiment of the present invention, a simple description is first given to a flow of acoustic model training.

As shown in fig. 1, the training process of the acoustic model of this embodiment is as follows:

s110: obtaining original data of Minnan speech, and extracting a speech feature sequence in the original data.

Original data of Minnan language voice is an existing data set, and the acoustic model is conveniently trained by adopting the existing data set. The extraction of the voice feature sequence is divided into the following steps:

s111: extracting 40-dimensional Fbank features from the original data, and acquiring first-order difference features and second-order difference features of the Fbank features.

S112: and splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature.

S113: and performing cepstrum mean variance normalization processing on the spliced features.

S114: and performing 3-time down-sampling on the features subjected to cepstrum mean variance normalization processing to obtain a voice feature sequence.

The purpose of down-sampling is mainly for the speed of training and decoding, and the down-sampling can be considered to have no loss of precision due to the fact that the speech feature sequence contains the second-order difference feature. Illustratively, the Chinese annotation text data corresponding to the original data of Min-south speech is that I love Beijing, and the extracted speech feature sequences can be 4, which are respectively the speech feature frames corresponding to "I", "love", "North", "Beijing".

S120: and inputting the voice feature sequence output in the step S114 into the target model to obtain the probability distribution of different predicted phoneme sequences corresponding to the voice feature sequence.

The target model has various forms, including CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), RNN (Simple recurrent Neural Network), and the like. In this embodiment, the target model is an LSTM model, and in order to solve the problem of gradient disappearance in the RNN model, the LSTM model maintains the gradient by introducing the state c, thereby alleviating the problem of gradient disappearance. Illustratively, the target model is 6 layers of bi-directional LSTM, the number of hidden layer units is set to 320, Dropout layers are added between each layer of LSTM, and the retention probability is set to 0.5. A two-way LSTM model can be constructed using Pytrch, using Adam as an optimizer for parameter learning. Initially the learning rate was set to 0.001, and when the objective function no longer changed, the learning rate was again reduced to 0.0001 and training continued until the objective function no longer changed, without any pre-training of the bi-directional LSTM model.

For the conditional random field model CRF, given an observed variable x, the state sequence π is defined as follows:

wherein, pi and x are equal in length, and theta is a parameter required to be learned by the model. π maps through a CTC B:and l are linked. This CTC mapping B maps the state sequence pi to a unique annotation sequence i. S_πAnd S_lSymbol tables corresponding to pi and L, respectively, M and L are the lengths of pi and L, respectively, and given these definitions, p (L | x; θ) is defined as

When x is the input speech feature sequence, pi is the corresponding different predicted phoneme sequence, and all the predicted phoneme sequences { pi ] corresponding to each speech feature sequence are obtained₁,π₂,……,π_MAnd the probability distribution of the corresponding predicted phoneme sequence. As shown in FIG. 2, the state sequences are independent from each other, and to break the mutual independence between the state sequences, edges need to be added into CRFs corresponding to CTCs, and edge potential energy based on language models is added to obtain better performance. Adding edge potential energy into CRF corresponding to CTC to obtain a CTC-CRF model, wherein the conditional random field added with the edge potential energy is shown in FIG. 3, the edge potential energy is calculated in advance rather than being learned through parameters, and an edge potential energy function phi (pi, x) is defined as:

wherein l is a labeling sequence l ═ B (pi) obtained by mapping the state sequence pi. The first term in equation (3) represents the point potential and the second term represents the edge potential, where p_LM(l) Is defined by the WFST represented by a phoneme-level n-gram.

Illustratively, the different predicted phonemes corresponding to the speech feature frame "i" are "wo 1", "wo 3" and "wo 4", and the different predicted phonemes corresponding to the speech feature frame "i" are "ai 1", "ai 2", "ai 3" and "ai 4", where the number 1 represents a tone, the number 2 represents a tone, the number 3 represents a tone, and the number 4 represents a tone.

S130: and mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm, and recording the sequence as T.fst.

The basic idea of CTC is to align the speech feature sequence and the annotation sequence by introducing a space character, and establish a mapping from the annotation sequence added with the space sequence to the actual annotation sequence, where this mapping is denoted as B, i.e. the CTC mapping. The specific steps of CTC mapping are:

s131: converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

s132: adding blank symbols in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

s133: removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

s134: and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

Illustratively, taking the state sequence a- - -RR-R- -a-as an example, first removing all repeated characters after the blank character to obtain a- - -R- -a-, and then removing all blank characters to obtain the ARRA, i.e. B (a- - -RR-R- -a-) -ARRA.

S140: and performing combined operation on the T.fst and the G.fst to obtain a sub-mother graph which is marked as fst.

In order to avoid inaccurate estimation of the partial mother graph caused by randomly inserting the mute phoneme in the annotation sequence, the mute phoneme is not used, or the mute phoneme is absorbed by adopting a blank character.

Daniel Povey proposed word lattice independent maximization mutual information training LF-MMI in 2016. In the LF-MMI, the denominator map used for path summation comes not from the word lattice obtained by decoding, but from a prepared denominator map. The submaster imageT stands for the WFST from the Chinese dictionary to the phoneme notation sequence, and G stands for a phoneme-based n-gram language model. The LF-MMI implements the estimation of the submaster graph on the GPU.

The steps S110 to S120 of acquiring the concept distribution and the steps S130 to S140 of constructing the sub-mother graph may be performed in parallel, and there is no chronological order.

S150: and calculating to obtain the probability corresponding to the optimal prediction phoneme sequence through an objective function according to the probability distribution and the denominator graph, and performing back propagation training through the value of the objective function to obtain a trained acoustic model.

In the training process, considering accelerated convergence and strengthening training stability, a CTC objective function is adopted as an auxiliary objective function, and therefore, the objective function for training is as follows:

wherein, J_all(theta) represents an objective function, alpha represents a co-factor, x represents an input speech feature sequence, theta represents a model parameter, pi_mRepresenting the M-th predicted phoneme sequence corresponding to the speech feature sequence x, M representing the number of different predicted phoneme sequences corresponding to the speech feature sequence x, l representing the phoneme notation sequence corresponding to the speech feature sequence x, p (pi)_m| x) indicates that for an input speech feature sequence of x, the output predicted phoneme sequence is π_mThe probability of time, p (l | x; θ), represents the probability of when the input speech feature sequence is x and the output phoneme label sequence is l under the model parameter θ. In this example, α is set to 0.1.

Exemplarily, the probability of the predicted phoneme sequence "wo 3 ai4 bei3 jing 1" corresponding to the voice feature frame corresponding to "i love beijing" is 1, the probability of the predicted phoneme sequence "wo 1 ai4 bei3 jing 1" is 0.75, the probability of the predicted phoneme sequence "wo 1 ai2 bei3 jing 1" is 0.5, the probability of the predicted phoneme sequence "wo 1 ai2 bei1 jing 1" is 0.25, the probability of the predicted phoneme sequence "wo 1 ai2 bei1 jing 3" is 0, and the like.

After the trained acoustic models are obtained, the trained acoustic models are applied to the speech recognition of south minna, as shown in fig. 4, the specific process of the speech recognition of south minna includes:

s210: and during recognition, inputting the speech feature sequence to be recognized into the trained acoustic model to obtain the probability that the speech feature sequence to be recognized corresponds to different prediction phoneme sequences.

Illustratively, the acoustic model outputs a probability of "Tiananmen" of 0.5 and a probability of "Tiandarkmen" of 0.5.

S220: combining with the dictionary, the language model outputs the probability.

S230: and decoding and searching according to the probability output by the acoustic model and the probability output by the language model to obtain an optimal recognition result.

In decoding search, the scoring function adopted by the identification result is as follows:

S＝logp(l|x)+βlogp_w(l) (5)

wherein S represents the score of the recognition result, p (l | x) represents the probability that the speech feature sequence to be recognized output by the acoustic model is x, the corresponding phoneme mark sequence is l, beta represents an influence coefficient, and p represents the probability that the speech feature sequence to be recognized is output by the acoustic model_w(l) The probability of the phoneme notation sequence l in the language model representing the word level. In this embodiment, β is set to 1. Exemplary, "Tiananmen" has a score of S₁0.5+1 × 1-1.5, and a score of "tiandarkgate" is S₂And (4) outputting the recognition result with higher score as the optimal recognition result when the value is 0.5+1 multiplied by 0.5-1.

The Minnan language speech recognition engine encapsulates the engine through a cross-platform multimedia processing framework Gstreamer and constructs a complete speech recognition pipeline, can support common formats such as wav, ogg and mp3, and adapts to different sound channels and sampling rates; the speech recognition engine provides services to the outside through a websocket or an HTTP protocol, obtains the characteristics of input audio through characteristic extraction, submits the characteristics to an acoustic model, performs comprehensive decoding search by combining a language model, and outputs a recognition result.

In a second aspect, the present embodiment further provides a southern minna speech recognition system, including:

the probability distribution acquisition module is used for acquiring original Minnan speech data and extracting a speech feature sequence in the original data; inputting the voice feature sequence into a target model to obtain probability distribution of different prediction phoneme sequences corresponding to the voice feature sequence;

a first finite state converter module, which is used for mapping the Chinese dictionary to a corresponding phoneme labeling sequence through a CTC algorithm and recording the sequence as T.fst;

a denominator graph constructing module, configured to perform a combination operation on the t.fst and the g.fst to obtain a denominator graph, which is denoted as fst;

Preferably, the probability distribution obtaining module is specifically configured to:

extracting Fbank characteristics from the original data, and acquiring first-order difference characteristics and second-order difference characteristics of the Fbank characteristics;

splicing the Fbank feature of the current frame with the first-order difference feature and the second-order difference feature of the Fbank feature;

performing cepstrum mean variance normalization processing on the spliced features;

and performing down-sampling on the features subjected to the cepstrum mean variance normalization processing to obtain the voice feature sequence.

Preferably, the first finite state transducer module is specifically configured to:

converting the Chinese dictionary into a plurality of Chinese phoneme labeling sequences;

adding blank characters in the Chinese phoneme labeling sequence to align a feature sequence in a Chinese dictionary with the Chinese phoneme labeling sequence;

removing continuous repeated characters in the Chinese phoneme label sequence added with the blank characters;

and removing all blank symbols to obtain a corresponding phoneme labeling sequence.

Preferably, in the objective function calculation module, the specific expression of the objective function is formula (4).

In a third aspect, this embodiment further provides an apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method for southern Fujian speech recognition according to the first aspect when executing the computer program.

In this embodiment, the processor is a central processing unit, or other programmable general purpose or special purpose microprocessor, digital signal processor, programmable controller, application specific integrated circuit, programmable logic device, other similar processing circuits, or a combination of these.

In this embodiment, the memory is an EEPROM, an embedded multimedia memory card eMMC, a DRAM, a flash memory, a nonvolatile random access memory, or the like.

The medium is a storage medium, specifically an EEPROM, an embedded multimedia memory card eMMC, a DRAM, a flash memory, or a nonvolatile random access memory, or the like.

The above disclosure is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or modifications within the technical scope of the present invention, and shall be covered by the scope of the present invention.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种人工智能语音识别分析方法、系统、装置及存储介质

Minnan language voice recognition method, system, equipment and medium

相关技术

网友询问留言