Method for realizing language identification based on deep learning

文档序号：1073214 发布日期：2020-10-16 浏览：22次中文

阅读说明：本技术 基于深度学习实现语种识别的方法 (Method for realizing language identification based on deep learning ) 是由黄诗雅罗睦军邓从健于 2020-06-03 设计创作，主要内容包括：本发明公开了一种基于深度学习实现语种识别的方法,包括：获取通话录音文件后,通过阿里云ASR及语种识别接口,生成语种文本数据集；对反馈的识别结果进行语种文本降噪处理；提取类别下的语种文本进行识别,对类别进行语种判定,完成训练语料的制作工序；对训练语料的词映射成索引表示,构建词汇-索引映射表,对语种标签构建标签-索引映射表,从预训练的词向量模型中读取出词向量,作为初始化值输入到模型中,通过映射表把语种文本与语种标签数值化为索引表示并填充为定长,提交给深度学习分类器训练；深度学习分类器对待测试语种文本进行分析预测,找出概率最高的语种类别。本发明能降低人工复听压力、节省人力、高效率、自动化、准确性高。(The invention discloses a method for realizing language identification based on deep learning, which comprises the following steps: after a call recording file is obtained, a language text data set is generated through an Ali cloud ASR and a language identification interface; performing language text noise reduction processing on the fed back recognition result; extracting language texts under the categories to identify, and judging the languages of the categories to finish the manufacturing process of the training corpus; mapping words of a training corpus into index representation, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from a pre-trained word vector model, inputting the word vectors into the model as an initialization value, digitizing language texts and the language labels into index representation through the mapping table, filling the index representation into a fixed length, and submitting the index representation to a deep learning classifier for training; and the deep learning classifier analyzes and predicts the language text to be tested and finds out the language category with the highest probability. The invention can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.)

1. A method for realizing language identification based on deep learning is characterized by comprising the following steps:

A) after a call recording file is obtained, a language text data set is generated through an Ali cloud ASR and a language identification interface;

B) performing language text noise reduction processing on the fed back recognition result;

C) extracting language texts under the categories by a manual sampling method to identify, and judging the languages of the categories to finish the manufacturing process of the training corpus;

D) mapping words of the training corpus into index representations, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from a pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, and finally, numerically expressing language texts and language labels into index representations through the vocabulary-index mapping table and the label-index mapping table, filling the index representations into fixed lengths, and submitting the index representations to a deep learning classifier for training;

E) and the deep learning classifier analyzes and predicts the language text to be tested and finds out the language category with the highest probability.

2. The method for realizing language identification based on deep learning according to claim 1, wherein the step B) further comprises:

B1) screening language texts with language identification accuracy higher than a set value, removing the identified wrong languages outside the non-professional field through condition judgment, and only keeping the language texts with high identification accuracy;

B2) and performing word segmentation on the language text, then matching words with the words of the stop word list, and filtering out stop words.

3. The method for realizing language identification based on deep learning according to claim 2, wherein the step D) further comprises:

D1) reading the training corpus into a memory, and performing word segmentation processing on each document;

D2) by calculating the word frequency of each word in the document, filtering out words with the word frequency smaller than the lowest threshold value and higher than the highest threshold value, mapping the residual non-repeated words into an index representation, namely constructing a vocabulary-index mapping table, and constructing a label-index mapping table for all non-repeated language labels;

D3) reading a word vector corresponding to a word-index mapping table by adopting a word2vec word vector model with a Tencent AI open source as an initial value of the word2vec word vector model;

D4) digitizing each document word through the vocabulary-index mapping table, carrying out fixed length processing on the condition that the length of each document is inconsistent, intercepting the document with the length longer than the highest threshold value and expanding the document with < PAD > shorter than the lowest threshold value, and storing the vocabulary-index mapping table and the word vector into a configuration file.

4. The method for realizing language identification based on deep learning of claim 3, wherein the deep learning classifier adopts TEXTCNN text classifier.

5. The method of claim 4, wherein the text classifier of TEXTCNN is used to predict the classification of the language text, and the language text data is transformed into a text sequence with a fixed length and then placed into a CNN network structure for training.

6. The method for realizing language identification based on deep learning of claim 5, wherein the CNN network structure is composed of an input layer, a convolutional layer, a pooling layer and a full connection layer.

7. The method for realizing language identification based on deep learning of claim 6, wherein a text sequence c with a fixed length n is input in the input layer, n is an integer and n is more than or equal to 1; each word is represented by a word vector xi, each word is embedded with a dimension k, and a sentence is represented by xi, n is x1 ≦ x2 ≦ x ≦ xn, wherein the word vector xi adopts a pre-training word2vec as the input of the input layer, and fine adjustment is not performed in the model training process.

8. The method for realizing language identification based on deep learning of claim 7, wherein the convolutional layer uses m convolutional kernels with different sizes, m is an integer and is greater than or equal to 1, the height h of the convolutional kernel is a window value, the height h is 2-8, the width of the convolutional kernel is the dimension equal width k of a word vector, and the convolutional kernel is omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and a sliding window n-h +1 times is required for a text c of a language to be slid once, and the result of the convolution summary of the text c of the language is c ═ c₁,c₂,...,c_n-h+1]。

9. The method for realizing language identification based on deep learning of claim 8 wherein Max pooling layer Max-pool is adopted

10. The method of claim 9, wherein a layer of the fully-connected layer is used, and y ═ ω × z + b, i.e. the extracted features z, is input into an LR classifier for classification.

Technical Field

The invention relates to the field of telecommunication, in particular to a method for realizing language identification based on deep learning.

Background

Currently, language data for the customer service hotline is lacked, and user attribute features such as language type, address information, service requirement content and the like can be mined in the call content. In subsequent service analysis, the change condition of service requirements of each user group needs to be mined from the basic attributes of the users, the complaint monitoring system of each user group is perfected, and favorable data support is provided for subsequent refined user operation and maintenance. In the absence of basic user attribute indexes (using language types), recording data needs to be labeled manually. However, since 2 or 3 general users dial a client hotline to consult business conditions every day, if a telecom operator needs to classify the languages of the customer service hotline service, a great amount of manpower is needed to perform repeated listening and marking of call recording every day, and at this time, a great amount of manpower and time are needed to be consumed by performing repeated listening and language marking only manually.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for implementing language identification based on deep learning, which can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for realizing language identification based on deep learning is constructed, and comprises the following steps:

A) after a call recording file is obtained, a language text data set is generated through an Ali cloud ASR and a language identification interface;

B) performing language text noise reduction processing on the fed back recognition result;

C) extracting language texts under the categories by a manual sampling method to identify, and judging the languages of the categories to finish the manufacturing process of the training corpus;

E) and the deep learning classifier analyzes and predicts the language text to be tested and finds out the language category with the highest probability.

In the method for implementing language identification based on deep learning according to the present invention, the step B) further includes:

B2) and performing word segmentation on the language text, then matching words with the words of the stop word list, and filtering out stop words.

In the method for implementing language identification based on deep learning according to the present invention, the step D) further includes:

D1) reading the training corpus into a memory, and performing word segmentation processing on each document;

D3) reading a word vector corresponding to a word-index mapping table by adopting a word2vec word vector model with a Tencent AI open source as an initial value of the word2vec word vector model;

In the method for realizing language identification based on deep learning, the deep learning classifier adopts a TEXTCNN text classifier.

In the method for realizing language identification based on deep learning, the text classifier of TEXTCNN is used for carrying out language text classification prediction, and after the language text data is converted into a text sequence with a fixed length, the text sequence is put into a CNN network structure for training.

In the method for realizing language identification based on deep learning, the CNN network structure is composed of an input layer, a convolution layer, a pooling layer and a full-link layer.

In the method for realizing language identification based on deep learning, a text sequence c with a fixed length n is input in the input layer, wherein n is an integer and is more than or equal to 1; each word is represented by a word vector xi, each word is embedded in a dimension k, and the sentence is represented as

And the word vector xi adopts a pre-training word2vec as the input of the input layer, and is not subjected to fine tuning in the model training process.

In the method for realizing language identification based on deep learning, the convolution layer uses m convolution kernels with different sizes, m is an integer and is more than or equal to 1, the height h of the convolution kernel is a window value, the height h takes the value of 2-8, the width of the convolution kernel is the dimension equal width k of a word vector, and the convolution kernel is omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and a sliding window n-h +1 times is required for a text c of a language to be slid once, and the result of the convolution summary of the text c of the language is c ═ c₁,c₂,...,c_n-h+1]。

In the radical of the inventionIn the method for realizing language identification in deep learning, Max-pool layer is adopted, namelyUsing the number of convolution kernels as m and the pooled data as

Each pooling may obtain a global maximum pooling.

In the method for realizing language identification based on deep learning, one layer of the full-connection layer is used, and y is omega z + b, namely the extracted feature z is input into an LR classifier for classification.

The method for realizing language identification based on deep learning has the following beneficial effects: firstly, after a user call recording file of a telecom operator is obtained, the recording is transcribed and the language type is identified through an Aliyun voice recognition interface, and a text with high recognition accuracy is reserved by noise reduction processing, so that the making of a training corpus is completed; then, building a network structure for modeling and training the corpus through deep learning, and finally, automatically recognizing the language of the daily call record of the operator through a characteristic model; the invention can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating an embodiment of a method for implementing language identification based on deep learning according to the present invention;

FIG. 2 is a flow chart of a method for implementing language identification by deep learning in the embodiment;

fig. 3 is a specific flowchart of performing language text denoising processing on the fed-back recognition result in the embodiment;

fig. 4 is a specific flowchart of the generation of the word vector model in the embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the method for implementing language identification based on deep learning, a flow chart of the method for implementing language identification based on deep learning is shown in fig. 1. Fig. 2 is a flow chart of a method for implementing language identification by deep learning in this embodiment. In fig. 1, the method for implementing language identification based on deep learning includes the following steps:

step S01, after acquiring the call recording file, generates a language text data set through the aricloud ASR and the language identification interface: in this step, after the call recording file of the user of the telecom operator is obtained, the speech language recognition system downloads the call recording file through the FTP, and a language text data set is generated through an ariclout speech recognition interface, namely an ariclout ASR and a language recognition interface.

Step S02 performs language text noise reduction processing on the fed-back recognition result: in this step, the language text noise reduction processing is performed on the fed back recognition result. Specifically, the speech language recognition system reads the call recording file, firstly obtains the transcribed language text content and language labels through the Aliyun ASR transcription and language recognition API interface, then eliminates the language texts in the non-business through a condition judgment method, and finally eliminates noise data by performing noise reduction processing on the language texts.

Step S03, extracting language texts under the category to identify through a manual sampling method, and judging the category of the texts to finish the manufacturing process of the training corpus: in the step, language texts under the categories are extracted for recognition through a manual sampling method, and the categories are judged, so that the manufacturing process of the training corpus is completed. Specifically, the speech language identification system stores language text files with high judgment accuracy into the same file; the service personnel of the speech language identification system performs sampling inspection on each language text and renames the text according to the real language attribute of the language text, thereby completing the manufacture of the training corpus.

Step S04, mapping words of the training corpus into index representation, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from the pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, finally quantizing language texts and language labels into index representation through the vocabulary-index mapping table and the label-index mapping table, filling the index representation into fixed length, and submitting the index representation to a deep learning classifier for training: in this step, the text language identification system loads the training corpus, extracts the classification result from the training corpus, and stores the classification result in the model file. Specifically, words of a training corpus are mapped into index representations, a vocabulary-index mapping table is constructed, a label-index mapping table is constructed for language labels, word vectors are read from a pre-trained word vector model and input into the word vector model as initialization values, finally, language texts and language labels are numerically represented as indexes through the vocabulary-index mapping table and the label-index mapping table, the indexes are filled to fixed lengths, and the indexes are submitted to a deep learning classifier for training. The deep learning classifier adopts a text classifier realized based on TEXTCNN, namely a TEXTCNN text classifier.

Step S05, the deep learning classifier analyzes and predicts the language text to be tested, and finds out the language category with the highest probability: in this step, the text language identification system downloads the call recording text (language text) to be predicted and analyzed through the FTP, performs recognition prediction or analysis prediction on the language text to be tested through the TEXTCNN text classifier, and finally finds out the language category with the highest probability, i.e. obtains the recognition result with the highest probability.

And transferring the call recording file and identifying the language type through the Aliyun voice recognition interface, and performing noise reduction to reserve the language text with high recognition accuracy so as to finish the manufacture of the training corpus. And then, building a network structure for modeling and training the corpus through deep learning, and finally, automatically recognizing the language of the daily call record of the operator through a characteristic model. The method for realizing language identification based on deep learning solves the problem that millions of call records per day need to be manually marked by operators at present, and a large amount of manpower is consumed. The method is based on natural language processing and deep learning, has the characteristics of high reliability, strong modeling and high accuracy, only needs few manual operations in the whole process, and does not depend on operators to provide training corpora, thereby saving a large amount of manpower and time cost for the operators.

Text classification prediction is performed by the TEXTCNN classifier. The TEXTCNN method converts language text data into a text sequence with a fixed length and then puts the text sequence into a CNN network structure for training. The CNN network structure is mainly composed of four parts: input layer, convolution layer, pooling layer and full-link layer. The specific prediction step comprises:

(1) input layer (word embedding layer): inputting a text sequence c with a fixed length n in an input layer, wherein n is an integer and is more than or equal to 1; each word is represented by a word vector xi, each word is embedded in a dimension k, and the sentence is represented as

The word vector xi adopts the pre-training word2vec as the input of the input layer, and is not fine-tuned in the model training process.

(2) The convolutional layer comprises m convolutional kernels with different sizes, wherein m is an integer and is more than or equal to 1, the height h of the convolutional kernels is a window value and is 2-8, the width of the convolutional kernels is the dimension equal width k of the word vector, and the convolutional kernels are omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and for the sliding first-time language text c, a sliding window is requiredn-h +1 times, the convolution summary result of language text c is c ═ c₁,c₂,...,c_n-h+1]。

(3) Using Max-pool of the largest pooling layer, i.e.Using the number of convolution kernels as m and the pooled data as

Each pooling may obtain a global maximum pooling.

(4) Full connection layer: and inputting the extracted features z into an LR classifier for classification by using a fully connected layer, wherein y is omega and z + b.

For the present embodiment, the step S02 can be further refined, and the detailed flowchart is shown in fig. 3. In fig. 3, the step S02 further includes:

step S21, the language text with the language identification accuracy higher than the set value is screened, the language with the error is identified except the non-professional field through condition judgment, and only the language text with high identification accuracy is reserved: in this step, language texts with language identification accuracy higher than a set value are screened, the language text with errors is identified outside the non-professional field through condition judgment, and only the language text with high identification accuracy is reserved.

Step S22, the language text is participled, then the words are matched with the words of the stop word list, and stop words are filtered out: in the step, the language text is subjected to word segmentation processing, then words are matched with the words of the stop word list, and stop words are filtered.

For the present embodiment, the step S04 can be further refined, and the detailed flowchart is shown in fig. 4. In fig. 4, the step S04 further includes:

step S41 reads the corpus into the memory, and performs word segmentation processing on each document: in this step, the training corpus is read into the memory, and word segmentation processing is performed on each document.

Step S42 is to map the remaining non-repeated words into an index representation by calculating the word frequency of each word appearing in the document, filtering out words whose word frequency is less than the lowest threshold and higher than the highest threshold, i.e., constructing a vocabulary-index mapping table, and constructing a label-index mapping table for all non-repeated language labels: in the step, words with the word frequency smaller than the lowest threshold value and higher than the highest threshold value are filtered out by calculating the word frequency of each word in the document, and then the remaining non-repeated words are mapped into index representations, namely a vocabulary-index mapping table is constructed. In addition, a label-index mapping table is also constructed for all the labels of the non-repetitive languages.

Step S43 reads a word vector corresponding to the vocabulary-index mapping table by using the word2vec word vector model open from Tencent AI as an initial value of the word2vec word vector model: in this step, word vectors corresponding to the vocabulary-index mapping table are read out by adopting a word2vec word vector model with a Tencent AI open source, and the word vectors are used as initial values of the word2vec word vector model.

Step S44, digitizing each document word through the vocabulary-index mapping table, performing fixed length processing on the inconsistent length of each document, intercepting the document with the length longer than the highest threshold value and extending the document with < PAD > shorter than the lowest threshold value, and storing the vocabulary-index mapping table and the word vector in a configuration file: in the step, each document word is digitized through the vocabulary-index mapping table, in addition, the fixed length processing is carried out on the inconsistent length condition of each document, the length is longer than the highest threshold value and is intercepted, the length is shorter than the lowest threshold value and is expanded by a PAD, and the vocabulary-index mapping table and the word vector are stored in a configuration file.

In a word, the invention relates to the fields of telecommunication communication, deep learning and natural language, and the method is a method for identifying the text languages of operators based on deep learning. The occurrence of deep learning can complete the making of training corpora by the existing API language identification and text noise reduction on the premise of reducing the marking of personnel in the early stage as much as possible; the training corpus is modeled through deep learning, and finally unstructured text analysis and language identification are carried out on the call recording text, so that the manual hearing pressure is reduced, and the labor is saved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：实现工单分析的热词检测方法

Method for realizing language identification based on deep learning

相关技术

网友询问留言