Mongolian text emotion analysis method fusing priori knowledge

文档序号:1905182 发布日期:2021-11-30 浏览:24次 中文

阅读说明:本技术 一种融合先验知识的蒙古语文本情感分析方法 (Mongolian text emotion analysis method fusing priori knowledge ) 是由 仁庆道尔吉 刘馨远 张倩 张文静 张毕力格图 郎佳珺 萨和雅 吉亚图 于 2021-07-26 设计创作,主要内容包括:一种融合先验知识的蒙古语文本情感分析方法,对含有表情符的蒙古语情感文本语料库进行预处理;将经过预处理得到的文本词和表情符转换为动态词向量;分别创建蒙古语情感词典和表情符的情感词典将二者提取的特征作为模型最终提取的情感特征;由CNN网络和门控机制组成蒙古语文本情感分析模型;将融入先验知识的模型在大规模蒙古语语料上进行预训练,得到融合先验知识的蒙古语文本情感分析模型;将该模型的分析结果与单一网络分析方法的分析结果就精确率、召回率和F1值进行对比和评价,达到提高蒙古语文本情感分析性能的目的。(A Mongolian emotion text corpus integrating prior knowledge is used for preprocessing a Mongolian emotion text corpus containing emoticons; converting the text words and the expression symbols obtained through preprocessing into dynamic word vectors; respectively creating an emotion dictionary of Mongolian and an emotion dictionary of emoticons, and taking the features extracted by the Mongolian emotion dictionary and the emotion dictionaries as the emotion features finally extracted by the models; a Mongolian text emotion analysis model is composed of a CNN network and a gating mechanism; pre-training the model fused with the priori knowledge on a large-scale Mongolian corpus to obtain a Mongolian text emotion analysis model fused with the priori knowledge; and comparing and evaluating the analysis result of the model with the analysis result of a single network analysis method in terms of accuracy, recall rate and F1 value, thereby achieving the purpose of improving the emotion analysis performance of Mongolian texts.)

1. A Mongolian text emotion analysis method fused with prior knowledge is characterized by comprising the following steps:

step 1: preprocessing a Mongolian emotion text corpus containing emoticons;

step 2: performing word segmentation on Mongolian corpora by using a BPE word segmentation technology;

and step 3: converting the words obtained through preprocessing into dynamic word vectors;

and 4, step 4: respectively creating Mongolian emotion dictionaries and emotion dictionaries of emoticons as prior knowledge of the models;

and 5: pre-training the model fused with the prior knowledge in a large-scale corpus to obtain a Mongolian text emotion analysis model fused with the prior knowledge model;

step 6: and comparing and evaluating the analysis result of the Mongolian text emotion analysis model fused with the priori knowledge with the analysis result of the single network analysis method in terms of accuracy, precision, recall rate and F1 value so as to achieve the purpose of improving the Mongolian text emotion analysis performance.

2. The method for Mongolian emotion analysis with a priori knowledge fused according to claim 1, wherein in the step 1, the preprocessing is to perform data cleaning on the obtained corpus to solve data source problems, such as original data errors and dirty and messy errors. The preprocessing includes the steps of removing username information, removing URLS, removing special characters, etc.

3. The Mongolian emotion analysis method fusing priori knowledge as claimed in claim 2, wherein a Byte Pair Encoding (BPE) word segmentation technology is adopted to segment preprocessed corpus information, a Glove model is utilized to train text corpora and emoticons to generate corresponding word vectors, word vector results are used to greedily find unknown words, and segmentation results are corrected.

4. The method for emotion analysis of Mongolian text fused with a priori knowledge according to claim 3, wherein the objective function J (W) for generating word vectors by GloVe training is as follows:

wherein W is a word vector matrix, and W belongs to R|V|*dV represents the number of words, d represents the word vector dimension; xijThe expression wjIn the word wiNumber of occurrences in the context of (1), WiThe expression wiWord vector of WjThe expression wjWord vector of f (X)ij) Is a weight term for removing low frequency term noise, and the expression is as follows:

wherein, XmaxIs XiMaximum value of (1), XiIs shown in the word wiLanguage ofNumber of times of all words, X, occurred in the contexti=∑jXij

For the original segmentation result Y ═ W1W2…WmComparing the current word w from scratchiWord vector W ofiWith the next word wi+1Word vector W ofi+1The formula of the cosine value of the included angle is as follows:

if the cosine value of the included angle is larger than a preset threshold lambda, the word w is considered to beiAnd the word wi+1Forming new words, wherein the combined word vector is the result of the addition and normalization of the two words, and the calculation formula is as follows:

Wi=null

continuing greedy matching by using the word vector of the new word until the sentence is finished to obtain a corrected segmentation resultWherein m is the number of word vectors in the original word segmentation result Y, and n is the corrected segmentation resultThe number of word vectors in (c).

5. The method for emotion analysis of Mongolian text fused with a priori knowledge as claimed in claim 1, wherein in step 4, a Mongolian emotion dictionary and an emotion dictionary of emoticons are respectively created as the priori knowledge of the model. The text emotion dictionary comprises emotion words of four emotions, i.e. happy, and the like, belonging to emotion categories of the emotion in the text emotion library, and words of like and wanted belonging to the textThe emotion classification of the like in the emotion library, and the isoemoticons belong to emotion categories of happenses in an emoticon emotion dictionary library.

6. The a priori knowledge fused Mongolian text emotion analysis method as recited in claim 5, wherein in step 5, a pre-training model fused with a priori knowledge is used, And the new gated Tanh-ReLU unit can selectively output emotion characteristics according to a given aspect or entity. The architecture is much simpler than the attention layer used in the existing model. Second, the computation of our model can be parallelized easily in the training process, since the convolutional layer does not have temporal dependency like the LSTM layer, and the gating cells also work independently.

7. The method for analyzing Mongolian text emotion fused with priori knowledge according to claim 4 or 5, characterized in that a concept of an integrated model is adopted, a prior knowledge pre-trained conditional Neural network-works And connected emotion models of a fused text emotion dictionary And an expression emotion dictionary are used as a final emotion analysis model, And relevant emotion features are extracted.

8. The Mongolian emotion analysis method fusing priori knowledge according to claim 1, wherein in the step 5, the network parameter weights learned by the neural network are trained by using the large-scale Mongolian multi-modal emotion corpus to form a parameter matrix connected with nodes of the neural network, the trained network parameter weights in the large-scale emotion analysis model are migrated to a specific Mongolian multi-modal emotion analysis model for initialization, and finally, the model is further trained by using the Mongolian emotion text corpus.

9. The method for emotion analysis of Mongolian text fused with a priori knowledge as claimed in claim 1, wherein in the step 6, the calculation formula of the accuracy rate isThe recall rate is calculated by the formulaThe F1 value is calculated by the formulaWherein P represents precision, R represents recall, F1 represents an F1 value, and TP represents the number of samples that are actually positive and predicted as positive by the model; FN represents the number of samples that are predicted by the model as negative examples, but are actually positive examples; FP represents the number of samples that are predicted by the model as positive examples, but are actually negative examples; TN represents the number of samples which are actually negative and are predicted to be negative by the model, and the higher the scores of the precision rate, the recall rate and the F1 value are, the better the performance of the emotion analysis model is.

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a Mongolian text emotion analysis method fusing priori knowledge.

Background

With the rapid development of internet technology, more and more people begin to publish various kinds of opinions on platforms such as micro blogs, forums, movie websites, shopping websites and the like so as to share their moods, opinions and opinions. With the development of the times, the vector of the text is diversified, wherein the emoticon is a new vector. These content published by the user may contain different emotional colors: are happy or favorite; sad or angry. The core of emotion analysis is to divide the emotion expressed by a text into four categories, namely happy, like, sadness and anger.

With the rise of artificial intelligence, the deep learning method gets wide attention, and the model has strong characteristic learning capability, so the deep learning method gradually becomes an important method for solving the emotion classification problem. However, the conventional text emotion analysis method has the following three disadvantages for small languages such as Mongolian language. Firstly, the rich morphology of Mongolian words causes serious unknown word phenomenon in the emotion analysis process of Mongolian text, and the accuracy of emotion analysis is seriously influenced by the existence of a large number of unknown words. Secondly, the existing single neural network model has no good real-time performance and poor classification effect when the problem of text emotion analysis is solved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a Mongolian text emotion analysis method fused with prior knowledge, which has the following three characteristics: firstly, combining the BPE technology and a word vector correction method, the problem of unknown words caused by the complexity of Mongolian grammar is better relieved; secondly, representing the text and the expression symbols into vector forms through a pre-training model respectively so as to fully utilize the emotional characteristics of the text and the expression symbols in the original data and achieve the purpose of analyzing the emotional target in multiple directions; thirdly, a Mongolian emotion dictionary of a Mongolian emotion dictionary and an emotion dictionary of an emoticon are constructed to serve as priori knowledge of a pre-training model, a Mongolian emotion text corpus training is utilized to obtain a Mongolian emotion analysis model of a neural network based on a convolution and gating mechanism, and the Mongolian emotion analysis quality is improved.

In order to achieve the purpose, the invention adopts the technical scheme that:

a Mongolian text emotion analysis method fusing prior knowledge comprises the following steps:

step 1: preprocessing a Mongolian emotion text corpus containing emoticons;

step 2: performing word segmentation on Mongolian corpora by using a BPE word segmentation technology;

and step 3: converting the words obtained through preprocessing into dynamic word vectors;

and 4, step 4: respectively creating Mongolian emotion dictionaries and emotion dictionaries of emoticons as prior knowledge of the models;

and 5: pre-training the model fused with the prior knowledge in large-scale corpora to obtain a Mongolian text emotion analysis model fused with the prior knowledge model;

step 6: and comparing and evaluating the analysis result of the Mongolian text emotion analysis model fused with the priori knowledge with the analysis result of a single network analysis method in terms of accuracy, precision, recall and F1 value so as to achieve the purpose of improving the Mongolian text emotion analysis performance.

In the step 1, the preprocessing is to perform data cleaning on the obtained corpus, so as to solve data source problems, such as errors and messy differences of original data. The preprocessing includes the steps of removing username information, removing URLS, removing special characters, etc.

In the step 2, a Byte Pair Encoding (BPE) word segmentation technology is adopted to segment the preprocessed corpus information, a Glove model is utilized to train the text corpus and the emoticons to generate corresponding word vectors, word vector results are used to greedily find unknown words, and segmentation results are corrected.

The objective function j (w) for generating word vectors by GloVe training is:

wherein W is a word vector matrix, and W belongs to R|V|*dV represents the number of words, d represents the word vector dimension number; xijThe expression wjIn the word wiNumber of occurrences in the context of (1), WiThe expression wiWord vector of WjThe expression wjWord vector of f (X)ij) Is a weight term for removing low frequency term noise, and the expression is as follows:

wherein, XmaxIs XiMaximum value of (1), XiIs shown in the word wiNumber of times of all words, X, appear in the context ofi=∑jXij

For the original segmentation result Y ═ w1w2…wmComparing the current word w from scratchiWord vector W ofiWith the next word wi+1Word vector W ofi+1The formula of the cosine value of the included angle is as follows:

if the cosine value of the included angle is larger than a preset threshold lambda, the word w is considered to beiAnd the word wi+1Forming new words, wherein the combined word vector is the result of the addition and normalization of the two words, and the calculation formula is as follows:

continuing greedy matching by using the word vector of the new word until the sentence is finished to obtain a corrected segmentation resultWherein m is the number of word vectors in the original word segmentation result Y, and n is the corrected segmentation resultThe number of word vectors in (c).

In the step 3, the words obtained through preprocessing are converted into languages which can be recognized and processed by a computer, namely dynamic word vectors.

In the step 4, a Mongolian emotion dictionary and an emotion dictionary of the emoticon are respectively created as prior knowledge of the model. Wherein, the text emotion dictionary comprises four emotion words of happenses, like, sadnes and sanger, for example, words of joy, happy, and the like belong to the emotion category of happenses in the text emotion library, words of like and wanted belong to the emotion category of like in the text emotion library, and the isoemoticons belong to emotion categories of happenses in an emoticon emotion dictionary library.

In the step 5, the new gated Tanh-ReLU unit can selectively output emotional features according to given aspects or entities by using a pre-training model of prior knowledge fusion. The architecture is much simpler than the attention layer used in the existing model. Second, the computation of our model can be parallelized easily in the training process, since the convolutional layer does not have temporal dependency like the LSTM layer, and the gating cells also work independently.

In the step 6, the calculation formula of the accuracy rateThe accuracy is calculated by the formulaThe recall rate is calculated by the formulaThe F1 value is calculated by the formulaWhere Acc represents accuracy, P represents accuracy, R represents recall, F1 represents F1 value, and TP represents the number of samples that are actually positive and predicted by the model as positive; FN represents the number of samples that are predicted by the model as negative examples, but are actually positive examples; FP represents the number of samples that are predicted by the model as positive examples, but are actually negative examples; TN represents the number of samples which are actually negative and are predicted to be negative by the model, and the higher the scores of the accuracy, the precision, the recall rate and the F1 value are, the better the performance of the emotion analysis model is.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention combines the BPE technology and the word vector correction method, and better relieves the problem of unknown words caused by the complexity of Mongolian language method.

(2) The method respectively expresses the text and the expression symbols into vector forms through the pre-training model so as to fully utilize the emotional characteristics of the text and the expression symbols in the original data and achieve the purpose of analyzing the emotional target in multiple directions.

(3) According to the Mongolian emotion analysis method, a Mongolian emotion dictionary and an emotion dictionary of emoticons are constructed to serve as priori knowledge of a pre-training model, a Mongolian emotion text corpus training is utilized to obtain a Mongolian emotion analysis model of a neural network based on a convolution and gating mechanism, and the Mongolian emotion analysis quality is improved.

Drawings

FIG. 1 is a flow chart of a Mongolian text emotion analysis method with a priori knowledge fused.

FIG. 2 is an architectural diagram of a gated convolution model.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in FIG. 1, the Mongolian text emotion analysis method fusing prior knowledge of the invention comprises the following steps:

the first step is as follows: and preprocessing the Mongolian emotion text corpus containing the emoticons. The processing is to perform data cleaning on the acquired corpus and solve the data source problems, such as error of original data and messy and poor results. The preprocessing includes the steps of removing username information, removing URLS, removing special characters, etc.

The second step is that: before model training, the emotion text corpus is preprocessed. The invention uses byte pair coding technology (BPE) to segment the material, because the BPE technology is a layer-by-layer iteration process of replacing a pair of characters with the most common frequency in a character string by a character which does not appear in the character string, high-frequency words can be kept in a dictionary by segmenting stem words and affix words of Mongolian words, and low-frequency words are segmented into subunits with smaller granularity, thereby relieving data sparseness and reducing unregistered words. The method comprises the following specific steps:

1. adding all characters in the corpus into a dictionary as an initialization dictionary, changing all words into a character segmentation form, and adding a mark at the tail, so that word segmentation information can be conveniently replied after a sentence is input;

2. counting character pairs in the corpus, finding out the character pair (A, B) with the most times, and replacing the character pair (A, B) with 'AB' in the corpus so as to add a key value 'AB' in a dictionary, wherein the step is called merging operation;

3. iterating the previous operation for n times until a certain number of merging operations are performed;

4. the dictionary consists of characters, morphemes, words and the like, and the size of the dictionary is equal to the size of the initial dictionary plus the number n of merging operations.

With the development of the internet, new words are continuously emerging, and the accuracy of the segmentation method based on word frequency without considering grammar and semantic information of the words is low. Therefore, in order to improve the segmentation performance, after the BPE link, the method selects and uses the GloVe model to train and generate word vectors, greedily discovers unregistered words by using word vector results, and corrects the segmentation results.

The GloVe model yields a vector representation of a word by decomposing the "word-word" matrix. The GloVe model needs to construct a co-occurrence matrix of words according to the corpus, and the statistics of the co-occurrence matrix is the co-occurrence times of the words in the limited environment of the given central words and the window size, so that the relation among the words can be expressed to a certain extent. The co-occurrence frequency is counted in the whole corpus, not only aiming at a sentence or a corpus, so that the method has global property. For words with similar expressions, the "distance" between them is also closer than for other words. For example, words around the word "people's government" include "city government", "administration", etc.; the words around the word "scientific research" include "scientific research", "scientific technology", "research", and the like. The word vector obtained through GloVe model training contains good grammar and semantic information.

The basic principle of segmentation correction of the word vector obtained by using the GloVe model is as follows: if the word wiAnd the word wjThe frequency of simultaneous occurrence is high, and the two can be combined into a new word w with a high probabilityiwj. The word vector generated by the GloVe model has the following properties: if the word wiAnd the word wjThere is a great possibility that new words w can be formediwjThen the two words correspondWord vector WiAnd WjThe cosine of the angle theta therebetween will be close to 1.

According to the principle, the following greedy method can be adopted for correction, and the specific steps are as follows:

1. converting words segmented by the BPE technology into word vectors, and training the target function of the word vectors by using a GloVe model to obtain:

wherein W is a word vector matrix, and W is an element of R|V|*dV represents the number of words, d represents the word vector dimension number; xijThe expression wjIn the word wiNumber of occurrences in the context of (1), WiThe expression wiWord vector of WjThe expression wjWord vector of f (X)ij) Is a weight term for removing low frequency term noise, and the expression is as follows:

in the formula, XmaxIs XiMaximum value of (1), XiIs shown in the word wiNumber of times of all words, X, appear in the context ofi=∑jXij

2. For the original segmentation result Y ═ w1w2…wmComparing the current word w from scratchiWord vector W ofiWith the next word wi+1Word vector W ofit1The formula of the cosine value of the included angle is as follows:

3. if the cosine value of the included angle is larger than a preset threshold lambda, the word w is considered to beiAnd the word wit1Forming new words, and adding and normalizing the word vectors to obtain the resultThe calculation formula is as follows:

4. continuing greedy matching by using the word vector of the new word until the sentence is finished to obtain a corrected segmentation resultWherein m is the number of word vectors in the original word segmentation result Y, and n is the corrected segmentation resultThe number of word vectors in (c).

The third step: and converting the words obtained through preprocessing into dynamic word vectors.

The fourth step: an emotion dictionary of Mongolian and an emotion dictionary of emoticons are created separately as prior knowledge of the model. Wherein, the text emotion dictionary comprises four emotion words of happenses, like, sadnes and sanger, for example, words of joy, happy, and the like belong to the emotion category of happenses in the text emotion library, words of like and wanted belong to the emotion category of like in the text emotion library, and the isoemoticons belong to emotion categories of happenses in an emoticon emotion dictionary library.

The fifth step: the invention adopts a pre-training model fused with prior knowledge, wherein the pre-training model is constant Neural Net-works + Gating Mechanisms, and the model is established on a Convolutional layer and a Gating unit. Each convolution filter computes n-gram signatures of different granularity from the embedded vector at each location, respectively. The gating cells at each location on top of the convolutional layer are also independent of each other. Therefore, our model is more suitable for parallel computing. In addition, our model is equipped with two efficient filtering mechanisms, a gating cell on top of the convolutional layer and a max-pool layer, both of which can accurately generate and select aspect-related emotional features.

And a sixth step: and comparing and evaluating the analysis result of the Mongolian text emotion analysis model fused with the prior knowledge model with the analysis result of a single network analysis method in terms of accuracy, recall rate and F1 value so as to achieve the purpose of improving the Mongolian text emotion analysis performance.

Wherein the accuracy is calculated byThe recall rate is calculated by the formulaThe F1 value is calculated by the formulaWherein, P represents the precision rate, which refers to the proportion of actual positive examples in the samples predicted as positive examples, and R represents the recall rate, which refers to the proportion of actual positive examples in the samples predicted as positive examples, and under normal conditions, the precision rate and the recall rate are contradictory, that is, increasing one index causes the decrease of the other index. F1 represents the F1 value, and F1 value is used to comprehensively evaluate the performance of the classification model in order to balance accuracy against recall. TP (true Positive) represents the number of samples that are actually positive and are predicted by the model to be positive; fn (false negative) represents the number of samples that are predicted by the model as negative, but actually positive; fp (false positive) represents the number of samples that are predicted by the model as positive, but actually negative; TN (TN) indicates the number of samples which are actually negative and are predicted to be negative by the model, and the higher scores of the precision rate, the recall rate and the F1 value indicate that the emotion analysis model has better performance. Table 1 gives the confusion matrix required in the calculation:

TABLE 1 confusion matrix

Prediction is a positive example Prediction is negative example
Is actually a positive example TP FN
Prediction is negative example FP TN

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种隐私政策文档中伪模糊检测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!