A kind of Language Identification and identifying system

文档序号:1757137 发布日期:2019-11-29 浏览:16次 中文

阅读说明:本技术 一种语种识别方法及识别系统 (A kind of Language Identification and identifying system ) 是由 张劲松 于嘉威 解焱陆 于 2019-08-01 设计创作,主要内容包括:本发明提供一种语种识别方法及识别系统,能够提高语种识别系统的性能。所述方法包括:将每帧语音信号转换成发音属性特征;利用所述发音属性特征训练时延神经网络,其中,将所述发音属性特征输入时延神经网络,所述时延神经网络对输入的发音属性特征进行学习和分类,得到每种语种在发音属性特征空间中的分布,即语种模型;进行语种识别时,将待识别语音的发音属性特征输入已训练好的时延神经网络中,时延神经网络的输出结果为待识别语音和每种语种模型的相似度,其中,相似度最大的语种模型为待识别语音的语种类别。本发明涉及语音识别技术领域。(The present invention provides a kind of Language Identification and identifying system, can be improved the performance of language recognition system.The described method includes: every frame voice signal is converted into pronunciation attributive character;Utilize pronunciation attributive character training time-delay neural network, wherein, by the pronunciation attributive character input delay neural network, the time-delay neural network is learnt and is classified to the pronunciation attributive character of input, obtain distribution of the every kind of languages in pronunciation attributive character space, i.e. languages model;When carrying out languages identification, it will be in the pronunciation attributive character input of voice to be identified trained time-delay neural network, the output result of time-delay neural network is the similarity of voice to be identified and every kind of languages model, wherein the maximum languages model of similarity is the languages classification of voice to be identified.The present invention relates to technical field of voice recognition.)

1. a kind of Language Identification characterized by comprising

Every frame voice signal is converted into pronunciation attributive character;

Utilize pronunciation attributive character training time-delay neural network, wherein by the pronunciation attributive character input delay nerve Network, the time-delay neural network are learnt and are classified to the pronunciation attributive character of input, are obtained every kind of languages and are belonged in pronunciation Distribution in property feature space, i.e. languages model;

When carrying out languages identification, the pronunciation attributive character of voice to be identified is inputted in trained time-delay neural network, when The output result of time-delay neural network is the similarity of voice to be identified and every kind of languages model, wherein the maximum languages of similarity Model is the languages classification of voice to be identified.

2. Language Identification according to claim 1, which is characterized in that belong to every frame voice signal is converted into pronunciation Before property feature, the method also includes:

Determine the pronunciation attribute extractor of the attribute that pronounces for identification based on frame level characteristics.

3. Language Identification according to claim 2, which is characterized in that the determination is used for based on frame level characteristics Identification pronunciation attribute pronunciation attribute extractor include:

According to the mapping relations between preset phoneme and pronunciation attribute, the training corpus based on phoneme is converted into pronunciation attribute Label, obtain the training set of characteristic extracting module;

Utilize the training set training characteristic extracting module;

Wherein, the characteristic extracting module includes M pronunciation attribute extractor, and each pronunciation attribute extractor corresponds to a kind of pronunciation Attribute, every kind of pronunciation attribute includes: several Attribute class;After every frame voice signal passes through each pronunciation attribute extractor, obtain One posterior probability with indicate belonging to the frame voice signal pronounce attribute classification, the M kind of the frame voice signal is different The posterior probability of the different Attribute class of the total Q class of pronunciation attribute is arranged, and obtains pronunciation attributive character, wherein Q indicates M kind The number for the Attribute class that pronunciation attribute includes.

4. Language Identification according to claim 3, which is characterized in that described to be converted by every frame voice signal Attributive character includes:

By M pronunciation attribute extractor, every frame voice signal is converted into the posterior probability institute group by the different Attribute class of Q class At pronunciation attributive character.

5. Language Identification according to claim 1, which is characterized in that softmax layers of section in time-delay neural network Point number is identical as languages number to be sorted, wherein softmax indicates more classification;

The last output of softmax layers of each node is the result is that sentence level, and each node is by all frames in a word Softmax layers of output result adduction takes the average last output result as corresponding node.

6. a kind of language recognition system characterized by comprising

Pronounce attribute extractor, for every frame voice signal to be converted into pronunciation attributive character;

Time-delay neural network, for utilizing pronunciation attributive character training time-delay neural network, wherein by the pronunciation attribute Feature input delay neural network, the time-delay neural network are learnt and are classified to the pronunciation attributive character of input, obtained Distribution of the every kind of languages in pronunciation attributive character space, i.e. languages model;When being also used to carry out languages identification, by language to be identified In trained time-delay neural network, the output result of time-delay neural network is language to be identified for the pronunciation attributive character input of sound The similarity of sound and every kind of languages model, wherein the maximum languages model of similarity is the languages classification of voice to be identified.

7. language recognition system according to claim 6, which is characterized in that the system also includes:

Determining module, for determining the pronunciation attribute extractor of the attribute that pronounces for identification based on frame level characteristics.

8. language recognition system according to claim 7, which is characterized in that the determining module, for according to preset Training corpus based on phoneme, is converted to the label of pronunciation attribute, obtains spy by the mapping relations between phoneme and pronunciation attribute The training set for levying extraction module utilizes the training set training characteristic extracting module;

Wherein, the characteristic extracting module includes M pronunciation attribute extractor, and each pronunciation attribute extractor corresponds to a kind of pronunciation Attribute, every kind of pronunciation attribute includes: several Attribute class;After every frame voice signal passes through each pronunciation attribute extractor, obtain One posterior probability with indicate belonging to the frame voice signal pronounce attribute classification, the M kind of the frame voice signal is different The posterior probability of the different Attribute class of the total Q class of pronunciation attribute is arranged, and obtains pronunciation attributive character, wherein Q indicates M kind The number for the Attribute class that pronunciation attribute includes.

9. language recognition system according to claim 8, which is characterized in that the pronunciation attribute extractor, being used for will be every Frame voice signal is converted into attributive character of pronouncing as composed by the posterior probability of the different Attribute class of Q class.

10. language recognition system according to claim 6, which is characterized in that softmax layers of section in time-delay neural network Point number is identical as languages number to be sorted, wherein softmax indicates more classification;

The last output of softmax layers of each node is the result is that sentence level, and each node is by all frames in a word Softmax layers of output result adduction takes the average last output result as corresponding node.

Technical field

The present invention relates to technical field of voice recognition, a kind of Language Identification and identifying system are particularly related to.

Background technique

Languages identification, which refers to the process of, distinguishes automatically using computer or confirms the affiliated category of language of sound bite.One Effective language recognition system can be widely used in the front end of multilingual speech recognition system and automatic translation system Among.The feature that can be used to distinguish languages has very much, comprising: acoustic feature, prosodic features, phonological construction feature, morphology shape State, syntactic feature etc..

Existing Language Identification can be divided into two classes according to the feature difference used: the 1. languages identification based on frequency spectrum Method.2. the Language Identification based on mark (token).What the Language Identification based on frequency spectrum utilized is different language The difference that spectrum signature is distributed in acoustic space.Current state-of-the-art languages identification model, such as: the entire variable factor (i- Vector) and x-vector method, wherein x-vector system includes a forward direction deep neural network, which will Elongated voice segments are mapped to the embeding layer of a fixed length, and the feature vector extracted from the embeding layer is known as x- Vector, i-vector and x-vector method are all that acoustical frequency spectrum parameter is projected to the relevant higher dimensional space of languages, then Identify languages.Based on the Language Identification of mark usually using the information of phonotactics (phonotactic), this information It is for describing the phoneme in a language is how to arrange, combine.Should a famous example in this way be exactly Phoneme recognizer combination language model method, this method pass through phoneme recognizer first and convert voice signals into phoneme sequence Then column extract N member statistic (N-gram) according to aligned phoneme sequence and are used as feature, finally established according to these statistical natures each The language model of languages is that every tested speech generates an a possibility that languages are related score by language model, comes according to this Identify languages.

Compared to the Language Identification based on mark, the ability of the Language Identification modeling temporal information based on frequency spectrum It is weaker.And the Language Identification based on mark cannot utilize difference between languages as the Language Identification based on frequency spectrum Acoustic feature distribution.The performance of language recognition system is highly dependent on the accuracy rate of identity recognizer simultaneously, and to train Such mark identifying system needs enough markd data and complete Pronounceable dictionary, but this appoints languages identification It is very difficult for low-resource languages in business.

In consideration of it, pronunciation attributive character (Articulatory features, AFs) is introduced in languages identification mission. What pronunciation attributive character indicated is the variation for the sound channel that phonatory organ is caused when sending out some specific phoneme.Different pronunciations The combination of attribute can indicate different phonemes.This means that the granularity for attributive character of pronouncing is smaller compared to phoneme feature, It is more general between languages, therefore its ability across Language Modeling is also stronger.So ought be equally using single or more A languages are come when identifying phoneme and pronunciation attribute, the recognition accuracy for the attribute that pronounces can be higher, thus the property of language recognition system It can also perform better than.In addition to this, the Language Identification based on pronunciation attributive character, is extracting pronunciation attributive character Afterwards, rear end majority models the phonological construction information of different language using the language model based on N-gram, but this method is scarce Point is the problem of language model of rear end can meet with Sparse, i.e., the quantity of N-gram can capture more sounds with desired Bit architecture contextual information and increase mark sequence length, so that the performance of language recognition system be made to decline.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of Language Identification and identifying systems, to solve the prior art The rear end of existing language recognition system models the phonological construction letter of different language using the language model based on N-gram The problem of ceasing, language recognition system performance caused to decline.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of Language Identification, comprising:

Every frame voice signal is converted into pronunciation attributive character;

Utilize pronunciation attributive character training time-delay neural network, wherein by the pronunciation attributive character input delay Neural network, the time-delay neural network are learnt and are classified to the pronunciation attributive character of input, are obtained every kind of languages and are being sent out Distribution in sound attributive character space, i.e. languages model;

When carrying out languages identification, the pronunciation attributive character of voice to be identified is inputted into trained time-delay neural network In, the output result of time-delay neural network is the similarity of voice to be identified and every kind of languages model, wherein similarity is maximum Languages model is the languages classification of voice to be identified.

Further, before every frame voice signal to be converted into pronunciation attributive character, the method also includes:

Determine the pronunciation attribute extractor of the attribute that pronounces for identification based on frame level characteristics.

Further, the pronunciation attribute extractor packet of for identification pronounce attribute of the determination based on frame level characteristics It includes:

According to the mapping relations between preset phoneme and pronunciation attribute, the training corpus based on phoneme is converted into pronunciation The label of attribute obtains the training set of characteristic extracting module;

Utilize the training set training characteristic extracting module;

Wherein, the characteristic extracting module includes M pronunciation attribute extractor, and each pronunciation attribute extractor is corresponding a kind of Pronounce attribute, and every kind of pronunciation attribute includes: several Attribute class;After every frame voice signal passes through each pronunciation attribute extractor, Obtain a posterior probability to indicate the classification of pronunciation attribute belonging to the frame voice signal, not by the M kind of the frame voice signal The posterior probability of the different Attribute class of the total Q class of same pronunciation attribute is arranged, and obtains pronunciation attributive character, wherein Q is indicated The number for the Attribute class that M kind pronunciation attribute includes.

Further, it is described by every frame voice signal be converted into pronunciation attributive character include:

By M pronunciation attribute extractor, every frame voice signal is converted by the posterior probability of the different Attribute class of Q class Composed pronunciation attributive character.

Further, softmax layers of node number is identical as languages number to be sorted in time-delay neural network, In, softmax indicates more classification;

The result is that sentence level, each node will be all in a word for the last output of softmax layers of each node The output result adduction of the softmax layer of frame takes the average last output result as corresponding node.

The embodiment of the present invention also provides a kind of language recognition system, comprising:

Pronounce attribute extractor, for every frame voice signal to be converted into pronunciation attributive character;

Time-delay neural network, for utilizing pronunciation attributive character training time-delay neural network, wherein by the pronunciation Attributive character input delay neural network, the time-delay neural network are learnt and are classified to the pronunciation attributive character of input, Obtain distribution of the every kind of languages in pronunciation attributive character space, i.e. languages model;It, will be wait know when being also used to carry out languages identification In trained time-delay neural network, the output result of time-delay neural network is wait know for the pronunciation attributive character input of other voice The similarity of other voice and every kind of languages model, wherein the maximum languages model of similarity is the languages classification of voice to be identified.

Further, the system also includes:

Determining module, for determining the pronunciation attribute extractor of the attribute that pronounces for identification based on frame level characteristics.

Further, the determining module, for according to preset phoneme and pronunciation attribute between mapping relations, by base The label of pronunciation attribute is converted in the training corpus of phoneme, the training set of characteristic extracting module is obtained, utilizes the training set The training characteristic extracting module;

Wherein, the characteristic extracting module includes M pronunciation attribute extractor, and each pronunciation attribute extractor is corresponding a kind of Pronounce attribute, and every kind of pronunciation attribute includes: several Attribute class;After every frame voice signal passes through each pronunciation attribute extractor, Obtain a posterior probability to indicate the classification of pronunciation attribute belonging to the frame voice signal, not by the M kind of the frame voice signal The posterior probability of the different Attribute class of the total Q class of same pronunciation attribute is arranged, and obtains pronunciation attributive character, wherein Q is indicated The number for the Attribute class that M kind pronunciation attribute includes.

Further, the pronunciation attribute extractor, for every frame voice signal to be converted into the Attribute class different by Q class Posterior probability composed by pronounce attributive character.

Further, softmax layers of node number is identical as languages number to be sorted in time-delay neural network, In, softmax indicates more classification;

The result is that sentence level, each node will be all in a word for the last output of softmax layers of each node The output result adduction of the softmax layer of frame takes the average last output result as corresponding node.

The advantageous effects of the above technical solutions of the present invention are as follows:

In above scheme, every frame voice signal is converted into pronunciation attributive character;Utilize pronunciation attributive character training Time-delay neural network, wherein by the pronunciation attributive character input delay neural network, the time-delay neural network is to input Pronunciation attributive character is learnt and is classified, and distribution of the every kind of languages in pronunciation attributive character space, i.e. languages model are obtained; When carrying out languages identification, by the input of the pronunciation attributive character of voice to be identified trained time-delay neural network, when sprawl Output result through network is the similarity of voice to be identified and every kind of languages model, wherein the maximum languages model of similarity For the languages classification of voice to be identified.In this way, using pronunciation attributive character it is across languages the characteristics of and time-delay neural network catch The ability of the contextual information of the pronunciation attributive character of input is obtained, so that language recognition system be helped preferably to learn to input The distinction information for attributive character of pronouncing, improves the performance of language recognition system.

Detailed description of the invention

Fig. 1 is the flow diagram of Language Identification provided in an embodiment of the present invention;

Fig. 2 is the Language Identification provided in an embodiment of the present invention based on pronunciation attributive character and time-delay neural network Detailed process schematic diagram;

Fig. 3 is the structural schematic diagram of time-delay neural network provided in an embodiment of the present invention;

Fig. 4 is the structural schematic diagram of language recognition system provided in an embodiment of the present invention.

Specific embodiment

To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.

The present invention models difference using the language model based on N-gram for the rear end of existing language recognition system The phonological construction information of languages, provides a kind of Language Identification and identification at the problem of causing language recognition system performance to decline System.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:多方言识别方法、装置、设备及可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!