Language identification method

文档序号:1467472 发布日期:2020-02-21 浏览:12次 中文

阅读说明:本技术 一种语种识别方法 (Language identification method ) 是由 田剑豪 龚晓峰 杨文� 于 2019-10-21 设计创作,主要内容包括:本发明公开了一种语种识别方法,包括:步骤101:采集多种语言的语音信号,获取原始语音数据;步骤102:针对语音信号数据进行格式和速率变换,统一数据格式和采样速率;步骤103:对语音信号数据进行预处理,并根据预先确定的正则化方法,生成与指定时间长度相对应的语谱图数据库,并根据预定的划分门限值,划分出训练集、验证集和测试集共三个部分;步骤104:基于构建的神经网络,利用训练集和验证集进行多次迭代训练,达到预期的语种分类识别准确度;步骤105:将测试集语谱图数据导入神经网络,进一步核实各个语种的识别效果。本发明语种类别的识别准确性高,并且处理速度快,实现成本低,可用于边境敌我信号侦察、自动翻译、入境检测等多种应用场景。(The invention discloses a language identification method, which comprises the following steps: step 101: collecting voice signals of multiple languages to obtain original voice data; step 102: carrying out format and rate conversion on voice signal data, and unifying data format and sampling rate; step 103: preprocessing voice signal data, generating a voice spectrum database corresponding to a specified time length according to a predetermined regularization method, and dividing a training set, a verification set and a test set into three parts according to a predetermined division threshold value; step 104: based on the constructed neural network, performing iterative training for multiple times by using a training set and a verification set to achieve the expected language classification and identification accuracy; step 105: and importing the test set spectrogram data into a neural network, and further verifying the recognition effect of each language. The method has the advantages of high language category identification accuracy, high processing speed and low implementation cost, and can be used for various application scenes such as border friend and foe signal reconnaissance, automatic translation, entry detection and the like.)

1. A language identification method, comprising:

step 101: collecting voice signals of multiple languages to obtain original voice data;

step 102: carrying out format and rate conversion on voice signal data, and unifying data format and sampling rate;

step 103: preprocessing the voice signal data, generating a voice spectrum database corresponding to the specified time length according to a predetermined regularization method, and dividing a training set, a verification set and a test set into three parts according to a predetermined division threshold value;

step 104: based on the constructed neural network, performing multiple iterative training by using the training set and the verification set to achieve the expected language classification and identification accuracy;

step 105: and importing the test set spectrogram data into a neural network, and further verifying the recognition effect of each language.

2. The language identification method of claim 1, wherein in step 102, the unified data format is wav format, the sample data sampling rate is 22050Hz, and the data sampling point bit width is 16 bits.

3. The language identification method as claimed in claim 2, wherein in step 103, the training speech data in the training stage is segmented into a plurality of speech segments according to a predetermined truncation length len _ wav _ section.

4. The language identification method according to claim 3 wherein, in step 103, each of said speech segments is divided into 200 speech segment sub-segments, each of said speech segment sub-segments has a length of 1024, and the overlap ratio between adjacent speech segment sub-segments is 50%, and then the data of each of said speech segment sub-segments is windowed.

5. The language identification method according to claim 4, wherein in step 103, each speech segment sub-segment is subjected to FFT spectrum transformation to obtain frequency domain data corresponding to each speech segment sub-segment, and a magnitude spectrum is obtained by taking a module.

6. The language identification method according to claim 5, wherein in step 103, the magnitude spectrum data of a plurality of speech segment sub-segments of the same speech segment are sequentially spliced into a two-dimensional matrix, each column corresponds to the magnitude spectrum data of one speech segment sub-segment, and finally a speech spectrum two-dimensional matrix of the speech segment is generated, and each speech segment is subjected to the above-mentioned operation.

7. The language identification method according to claim 6, wherein in step 104, according to preset neural network parameters, a stochastic gradient optimization algorithm is used to iteratively train the neural network, and the accuracy is counted in real time.

8. The language identification method according to claim 7, wherein in step 104, the number of pictures trained in each iteration is 10, and the verification set is used to perform a verification test of the language accuracy rate 200 times of each iteration.

9. The language identification method of claim 8, wherein in step 104, the neural network comprises: in the structure construction of the convolutional neural network, the sizes of convolutional layers are 5 × 5, 2 of convolutional layers are 32 convolutional cores and 3 × 3, 3 of convolutional layers are 32 convolutional cores and 3 × 3, 4 of convolutional layers are 32 convolutional cores and 1 × 1.

10. The language identification method of claim 9, wherein in step 104, a ReLU function is used in all non-linear layers.

Technical Field

The invention relates to the field of voice signal processing and mode recognition, in particular to a language recognition method.

Background

With the continuous progress of mobile internet and communication and information technology, the cost of voice acquisition, recording and storage is increasingly reduced, and the means are very rich. However, the low cost and universal application of automatic speech recognition, automatic translation of different languages, speaker identification, etc. are not yet achieved, as are automatic recognition and classification of multiple languages.

In international communication activities, it is necessary to automatically and quickly judge the language type of a speaker, and the method is also a precondition for quick and automatic translation of multi-national languages. In the inbound detection, the problem of rapidly identifying the real-time language category of the abnormal inbound person is faced. In the international criminal investigation activity, there are many scenes in which the language type of a suspect needs to be quickly judged. In military reconnaissance applications in defense border areas, information such as intercepted voice language types and the like needs to be monitored, analyzed and judged in real time. All of these are currently performed in the form of manual listening and analysis and judgment, relying on specialized talents with multiple language skills. This manual mode is consuming time and energy, and the human cost is high. More limited, most of the time, proper professional language talents matched with the actual environment cannot be found at all, and meanwhile, the professional language talents are rare, and the language types mastered by a single individual are very limited. Therefore, it is necessary to automatically identify the language type based on the recorded speech signal data by using the computer learning method and the conventional computer.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a language identification method.

The technical scheme of the invention is as follows: a language identification method comprises the following steps:

step 101: collecting voice signals of multiple languages to obtain original voice data;

step 102: carrying out format and rate conversion on voice signal data, and unifying data format and sampling rate;

step 103: preprocessing the voice signal data, generating a voice spectrum database corresponding to the specified time length according to a predetermined regularization method, and dividing a training set, a verification set and a test set into three parts according to a predetermined division threshold value;

step 104: based on the constructed neural network, performing multiple iterative training by using the training set and the verification set to achieve the expected language classification and identification accuracy;

step 105: and importing the test set spectrogram data into a neural network, and further verifying the recognition effect of each language.

In some embodiments, in step 102, the unified data format is wav format, the sample rate of the sample data is 22050Hz, and the bit width of the data sample point is 16 bits.

In some embodiments, in step 103, the training speech data in the training phase is segmented into speech segments according to a predetermined truncation length len _ wav _ section.

In some embodiments, in step 103, each of the speech segments is divided into 200 speech segment sub-segments, each of the speech segment sub-segments has a length of 1024, an overlap ratio between adjacent speech segment sub-segments is 50%, and then data of each of the speech segment sub-segments is windowed.

In some embodiments, in step 103, FFT spectrum transformation is performed on each of the voice segment sub-segments to obtain frequency domain data corresponding to each of the voice segment sub-segments, and a magnitude spectrum is obtained by taking a model.

In some embodiments, in step 103, the magnitude spectrum data of a plurality of voice fragment subsections of the same voice fragment are sequentially spliced into a two-dimensional matrix, each column corresponds to the magnitude spectrum data of one voice fragment subsection, and finally a two-dimensional matrix of a spectrogram of the voice fragment is generated, and each voice fragment is subjected to the above-mentioned operation.

In some embodiments, in step 104, according to preset neural network parameters, a stochastic gradient optimization algorithm is used to perform iterative training on the neural network, and the accuracy is counted in real time.

In some embodiments, in step 104, the number of pictures trained in each iteration is 10, and the verification set is used to perform a verification test on the language correctness rate once every 200 times of training in each iteration.

In some embodiments, in step 104, the neural network comprises: in the structure construction of the convolutional neural network, the sizes of convolutional layers are 5 × 5, 2 of convolutional layers are 32 convolutional cores and 3 × 3, 3 of convolutional layers are 32 convolutional cores and 3 × 3, 4 of convolutional layers are 32 convolutional cores and 1 × 1.

In some embodiments, in step 104, all non-linearization layers use the ReLU function.

The invention according to the scheme has the advantages that through the voice signal processing and convolutional neural network technology, the recognition speed is 10 spectrogram processed and recognized every second, and the whole process of preprocessing and recognizing one section of voice can be completed within 1 second (configured as a common computer with an Intel i5 three-generation CPU i5-3450 produced in 2013 and a 4GB DDR3 internal memory), so that the language type recognition of one section of voice is realized, the practical use requirement is completely met, the recognition accuracy is high, the processing speed is high, and the realization cost is low. Meanwhile, the method does not depend on the special characteristics of specific languages, and can be used for various application scenes such as border friend or foe signal reconnaissance, automatic translation, entry detection and the like.

Drawings

FIG. 1 is a flow chart of language identification processing of the present invention.

FIG. 2 is a flow chart of an implementation of the pre-processing of voice data samples of the present invention.

FIG. 3 is a flow chart of an implementation of augmented pre-processing of speech data samples according to the present invention.

FIG. 4 is a flow chart of a process implementation of the neural network training phase of the present invention.

FIG. 5 is a diagram of a language identification neural network according to the present invention.

FIG. 6 is a schematic diagram of a time-frequency two-dimensional spectrogram according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and embodiments:

the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, a language identification method includes:

step 101: collecting voice signals of multiple languages to obtain original voice data;

step 102: carrying out format and rate conversion on voice signal data, and unifying data format and sampling rate;

step 103: preprocessing the voice signal data, generating a voice spectrum database corresponding to the specified time length according to a predetermined regularization method, and dividing a training set, a verification set and a test set into three parts according to a predetermined division threshold value;

step 104: based on the constructed neural network, performing multiple iterative training by using the training set and the verification set to achieve the expected language classification and identification accuracy;

step 105: and importing the test set spectrogram data into a neural network, and further verifying the recognition effect of each language.

In some embodiments, in step 102, the unified data format is wav format, the sample rate of the sample data is 22050Hz, and the bit width of the data sample point is 16 bits.

In some embodiments, in step 103, the training speech data in the training phase is segmented into speech segments according to a predetermined truncation length len _ wav _ section.

In some embodiments, in step 103, each of the speech segments is divided into 200 speech segment sub-segments, each of the speech segment sub-segments has a length of 1024, an overlap ratio between adjacent speech segment sub-segments is 50%, and then data of each of the speech segment sub-segments is windowed.

In some embodiments, in step 103, FFT spectrum transformation is performed on each of the voice segment sub-segments to obtain frequency domain data corresponding to each of the voice segment sub-segments, and a magnitude spectrum is obtained by taking a model.

In some embodiments, in step 103, the magnitude spectrum data of a plurality of voice fragment subsections of the same voice fragment are sequentially spliced into a two-dimensional matrix, each column corresponds to the magnitude spectrum data of one voice fragment subsection, and finally a two-dimensional matrix of a spectrogram of the voice fragment is generated, and each voice fragment is subjected to the above-mentioned operation.

In some embodiments, in step 104, according to preset neural network parameters, a stochastic gradient optimization algorithm is used to perform iterative training on the neural network, and the accuracy is counted in real time.

In some embodiments, in step 104, the number of pictures trained in each iteration is 10, and the verification set is used to perform a verification test on the language correctness rate once every 200 times of training in each iteration.

In some embodiments, in step 104, the neural network comprises: in the structure construction of the convolutional neural network, the sizes of convolutional layers are 5 × 5, 2 of convolutional layers are 32 convolutional cores and 3 × 3, 3 of convolutional layers are 32 convolutional cores and 3 × 3, 4 of convolutional layers are 32 convolutional cores and 1 × 1.

In some embodiments, in step 104, all non-linearization layers use the ReLU function.

101. Multi-language voice data acquisition: the method comprises the steps of collecting multiple sound sources of multiple scenes, using the multi-language voices of multiple speakers as training samples, and enabling the total time length of voice recording of each language to be not less than 5 hours. The Chinese language, Russian language, Japanese language, Hindi language, Vietnamese language, French language, English language, Vietnamese language, German language, Italian language, Spanish language, Arabic language and Korean language are 13 languages in total. The language type can be continuously expanded, and the design only needs to adjust the output vector dimension of the full-connection layer of the output end of the convolutional neural network, so that the output vector dimension is equal to the language type.

The method comprises the steps of obtaining multi-language voice data, collecting voices of a plurality of different individuals as much as possible, wherein collection scenes comprise but are not limited to real-time speakers, vehicle-mounted radios, television broadcasts, internet radio stations, film and television works and the like. The number of people is increased, the collection scenes and types are rich, the generalization capability of the language identification system is improved, and the identification accuracy of strange voice samples is improved.

102. The sample data has uniform format: in the voice recording and collecting stage, the factors such as the format of the recorded voice and the like do not need to be concerned. After the acquisition and the collection of the multilingual voice data are finished, necessary format and rate conversion needs to be carried out on the voice sample data, the voice sample data can be uniformly converted and stored into a WAV format based on mature voice file conversion software, meanwhile, the data sampling rate is uniformly 22050Hz, and the sampling bit number is 16 bits.

103. Sample data preprocessing, namely performing the same preprocessing on the voice signal data, dividing the voice signal data into a plurality of voice fragments with about 4.6 seconds of duration, and generating a spectrogram picture corresponding to the specified duration, wherein the spectrogram contains rich time-frequency information related to the voice signal, as shown in fig. 6. After a plurality of languages of voice are segmented and spectrogram pictures are generated, the languages of voice are divided into 3 parts of a training set, a verification set and a test set according to the proportion of 7:2: 1. The training set is used for training network parameters of the neural network, the verification set is used for training recognition rate detection and super-parameter adjustment in the middle process, and the test set is used for carrying out final recognition accuracy rate test verification.

104. Constructing and training a neural network: and (3) constructing a convolutional neural network, configuring neural network training parameters, and performing repeated iterative training on the basis of samples of a training set to achieve the expected language classification and identification accuracy. The expected effect can be achieved through 60 epochs usually, and the identification accuracy rate of not less than 95% can be achieved for the verification set.

105. Based on the trained neural network, classification and identification are carried out on the test set samples, and the identification effect of each language is further verified.

Fig. 2 is a flowchart of a sample data preprocessing implementation, and details of implementation are described below with reference to fig. 2:

201. reading the WAV voice file data, and confirming whether the voice data sampling rate of the file is 22050Hz, whether the bit width of the sampled data is 16bit, if not, alarming and removing are needed.

202. The normalization processing is carried out for the voice data waveform, firstly, the maximum amplitude Audio _ max _ value of the voice segment is found out, and then the voice segment waveform data is uniformly divided by the value.

203. The voice waveform is divided into a plurality of voice segments according to a predetermined cut length len _ wav _ section. The overlap ratio between adjacent speech segments is 50%. The truncation length len _ wav _ section is 102912, i.e., the number of data samples per speech segment is 102912, corresponding to about 4.6 seconds of speech in time.

204. For each voice segment, the voice segment can be divided into 200 voice segment sub-segments, the length of each sub-segment is 1024, and the overlapping rate between adjacent sub-segments is 50%. Windowing is performed on the data of each sub-segment, here a hamming window is chosen. This is a conventional operation step prior to the short-time fourier transform process in non-stationary signal processing.

205. And performing FFT (fast Fourier transform) spectrum transformation processing on each voice fragment subsection (the number of sample points is 1024) to obtain frequency domain data corresponding to each voice fragment subsection, and performing modulus acquisition to obtain the amplitude spectrum of the voice fragment subsection.

206. And sequentially splicing the magnitude spectrum data of a plurality of voice subsections of the same voice subsection into a two-dimensional matrix, wherein each column corresponds to the magnitude spectrum data of one voice subsection, and finally generating a voice spectrum two-dimensional matrix of one voice subsection. And each voice segment is processed in the same way.

207. And carrying out color space mapping on each spectrogram two-dimensional matrix to generate a corresponding RGB three-color spectrogram image.

208. The spectrogram image is subjected to scale conversion and converted into an image having a size of 621 × 521.

209. And respectively marking and storing each spectrogram image into an image file in a BMP file format.

Fig. 3 is a flow chart of a preprocessing procedure for expanding the number of voice data samples, which is similar to the flow chart of fig. 2, except that a noise-adding processing unit, i.e. step 303, is additionally added to the voice data samples. By adding white noise of different signal to noise ratios, multiple voice data samples can be obtained. For example, noise with signal-to-noise ratios of 16dB, 14dB, 12dB and 10dB is added, so that the number of samples is 5 times of the number of original voice samples. It should be noted that, in order to increase the number of voice data samples, in addition to the noise processing for the voice waveform data, the processing may be performed on the finally generated spectrogram image, such as single-digit pixel shift processing in the horizontal direction of the spectrogram image, for example, noise processing for the image itself, and also the number expansion of the samples may be doubled.

Fig. 4 is a block diagram of an implementation of a processing flow in a neural network training phase, which is described in detail below with reference to the specific implementation of fig. 4:

401. the training set and validation are individually sequence-randomized such that the pictures within the Batch packet of each epoch phase are not exactly identical. And in an epoch phase, training is carried out based on all sample data in the training set.

402. Setting each hyper-parameter of the neural network, including parameter values such as initialization weight, MiniBatchSize, MaxEpoch, InitLearnRate, ValidationnReq, and the like. Specifically, the batch file size MiniBatchSize is set to 10, the maximum epoch frequency MaxEpoch is set to 100, the learning rate initleanrate is set to 0.0001, and the verification set verification frequency ValidationFreq is set to 200.

403. A convolutional neural network was constructed, and the connection structure of the neural network is shown in fig. 5.

404. And (4) performing iterative training on the neural network by adopting a random gradient optimization algorithm according to preset parameters, and counting the accuracy in real time. The number of pictures trained per iteration is 10. And performing verification test on the language accuracy once by adopting a verification set every 200 times of iterative training. The verification set only carries out verification and confirmation of the recognition rate and does not participate in the operation of the training process.

405. And stopping training when the accuracy of the verification set reaches an expected value or the epoch times reach a specified number.

406. Saving result data such as network and training parameters

Fig. 5 is a schematic structural diagram of a language identification neural network, which is described in detail below with reference to fig. 5:

501. the input image of the input layer is an RGB color image, and the image dimension is 621 multiplied by 521 multiplied by 3.

502. The convolution layer 1 has 16 convolution kernels with the size of 5 × 5, and simultaneously performs image edge filling processing to ensure that the size of an output image is kept unchanged.

503. The normalization process operation of normalization layer 1 is as follows:

a training sample set of inputs has m samples, i.e. { x }1,x2,...,xmTwo parameters to be learned, γ and β respectively.

Averaging the input sample set:

Figure BDA0002240551700000081

and (3) solving the variance of the input sample set:

Figure BDA0002240551700000082

normalization was performed for each sample:

the output of the normalization layer is:

Figure BDA0002240551700000084

the corresponding output sample set is: { y1,y2,...,ym}

504. And the nonlinear layer 1 carries out nonlinear processing by adopting a ReLU (rectified Linear units) activation function. The ReLU function is as follows:

Figure BDA0002240551700000091

505. the convolution layer 2 has 32 convolution kernels, each of which has a size of 3 × 3, and the image edge filling process is performed at the same time, with the convolution kernel step being 2.

506. The normalization layer 2 is processed in the same manner as 503.

507. The nonlinear layer 2 is processed in the same manner as 504.

508. The convolution layer 3 has 32 convolution kernels each having a size of 3 × 3, and performs image edge filling processing.

509. The normalization layer 3 is processed in the same manner as 503.

510. The nonlinear layer 3 is processed in the same manner as 504.

511. Convolutional layer 4, 32 convolutional kernels, all of which are 1 × 1 in size, and the convolutional kernels are stepped to 2.

512. And (4) adding layers. The output results of S511 and 510 are merged.

513. And (4) a pooling layer. The arithmetic mean processing mode is adopted, the size is 2 multiplied by 2, and the step is 2.

514. And (4) fully connecting the layers. The output dimension is 2.

515. Softmax layer. The processing of the Softmax function is as follows:

Figure BDA0002240551700000092

where Vi is the output of the output unit at the preceding stage, and i represents the category, here the number of categories of language. Si represents the current output value, and the function converts the output values of a plurality of classes into relative probabilities so that the sum of the outputs of all the class values is 1.

516. And outputting the layer to give a classification result.

Compared with the prior art, the voice recognition method has the advantages that by adopting the scheme, the voice signal processing and convolutional neural network technology is adopted, the recognition speed is 10 spectrogram processed and recognized every second, and the whole process of preprocessing and recognizing one section of voice can be completed within 1 second, so that the language type recognition of one section of voice is realized, the actual use requirement is completely met, the recognition accuracy is high, the processing speed is high, and the realization cost is low. Meanwhile, the method does not depend on the special characteristics of specific languages, and can be used for various application scenes such as border friend or foe signal reconnaissance, automatic translation, entry detection and the like.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

The invention is described above by way of example, and it is obvious that the implementation of the invention is not limited by the above-mentioned manner, and it is within the scope of the invention to adopt various modifications of the inventive method concept and technical solution, or to apply the inventive concept and technical solution to other occasions without modification.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音识别中间结果的质量评测方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!