Emotion classification model and emotion classification method based on data enhancement

文档序号：830116 发布日期：2021-03-30 浏览：9次中文

阅读说明：本技术 基于数据增强的情感分类模型及情感分类方法 (Emotion classification model and emotion classification method based on data enhancement ) 是由李博涵王文幻王萌历傲然杨新民解文彬于 2020-12-29 设计创作，主要内容包括：本发明公开一种基于数据增强的情感分类模型,该模型构建方法包括如下步骤：(1)获得原始数据集；(2)对原始数据集进行预处理和清洗；(3)对各文本进行否定处理；(4)将文本进行反转,形成对立文本；(5)标记原始文本和对应的对立文本的标签；(6)生成对立文本的数据集作为对立文本训练集；(7)采用分类器,从原始数据集和对立文本训练集两个方面训练分类器模型,获得情感分类模型。本发明利用数据增强技术得到对立文本训练集和对立文本测试集,并利用词嵌入将文本转换为词向量,从正反两个角度来对文本情感进行预测,增加了模型预测的鲁棒性,提高了模型预测的准确率,通过本发明提出的分类方法能够有效提高分类精度。(The invention discloses an emotion classification model based on data enhancement, and a model construction method comprises the following steps: (1) obtaining an original data set; (2) preprocessing and cleaning an original data set; (3) negating each text; (4) reversing the text to form a contradictory text; (5) labels for marking the original text and the corresponding opposite text; (6) generating a data set of the opponent text as an opponent text training set; (7) and training a classifier model from the original data set and the opponent text training set by adopting a classifier to obtain an emotion classification model. The method obtains the opponent text training set and the opponent text testing set by using a data enhancement technology, converts the text into word vectors by using word embedding, predicts the text emotion from positive and negative angles, increases the robustness of model prediction, improves the accuracy of model prediction, and can effectively improve the classification precision by using the classification method provided by the invention.)

1. An emotion classification model based on data enhancement is characterized in that the model construction method comprises the following steps:

(1) acquiring a target short text data set from a social platform, acquiring an original data set, analyzing the characteristics of short texts, and determining the quantity of each type of data of the text data set;

(2) preprocessing and cleaning the original data set obtained in the step (1);

(3) performing negative processing on each text processed in the step (2), including negative trigger word detection and negative range detection, performing mark replacement on the detected negative trigger words, confirming the range influenced by the negative trigger items, and dividing the distance related to the negative trigger items;

(4) reversing the text obtained in the step (3), keeping the emotion words in the negation range unchanged, replacing the emotion words outside the negation range according to the antisense words in the emotion dictionary, and removing the negation trigger item mark to form a opposite text;

(5) labels of an original text and a corresponding opposite text are marked, the original text is a positive label or a negative label, and the label of the opposite text is a negative label or a positive label;

(6) generating a data set of the opponent text as an opponent text training set;

(7) and (5) training a classifier model from the original data set and the opposite text training set obtained in the step (6) by adopting a classifier to obtain an emotion classification model.

2. The emotion classification model based on data enhancement of claim 1, wherein the cleansing of the original data set in step (2) includes filtering out attributes, mailboxes, special characters and links contained in the text, removing useless stop words, ignoring text that is forwarded and modified by the user, and removing duplicate entries.

3. The emotion classification model based on data enhancement as claimed in claim 1, wherein the text is subjected to negative processing in step (3) by the specific steps of:

3a, detecting a negative trigger item by using a keyword matching technology based on rules;

3b, replacing the detected negative trigger item with a mark 'Negap';

3c, detecting a negative range by combining a conjunctive analysis technology and a punctuation coincidence identification technology;

3d, confirming the negative range.

4. The emotion classification model based on data enhancement as claimed in claim 1, wherein before the text is inverted in step (4), the emoticons existing in the text are marked, and the positive emoticons and the negative emoticons are replaced with the marks EMO _ POS and EMO _ NEG, respectively.

5. The emotion classification model based on data enhancement as claimed in claim 1, wherein in step (7), a support vector machine and a logistic regression and naive Bayes classifier are adopted for model training.

6. An emotion classification method is characterized by comprising the following steps:

(1) constructing an emotion classification model based on any one of claims 1-5;

(2) acquiring a target short text data set from a social platform, acquiring an original test set, analyzing the characteristics of short texts, and determining the quantity of each type of data in the text data set;

(3) preprocessing and cleaning the original test set obtained in the step (2);

(4) performing negative processing on each text processed in the step (3), including negative trigger word detection and negative range detection, performing mark replacement on the detected negative trigger words, confirming the range influenced by the negative trigger items, and dividing the distance related to the negative trigger items;

(5) reversing the text obtained in the step (4), keeping the emotion words in the negation range unchanged, replacing the emotion words outside the negation range according to the antisense words in the emotion dictionary, and removing the negation trigger item mark to form a opposite text;

(6) labels of an original text and a corresponding opposite text are marked, the original text is a positive label or a negative label, and the label of the opposite text is a negative label or a positive label;

(7) generating a data set of the opposition text as an opposition text test set;

(8) and (3) carrying out emotion analysis on the data in the opposite text test set of the original test set by adopting the emotion classification model obtained in the step (1), wherein the final prediction result depends on the comprehensive prediction result of the original test set and the opposite test set.

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an emotion classification method for processing natural language and taking a data enhancement technology as a support.

Background

Nowadays, with the rise and development of various social media and commercial websites, people are more and more accustomed to posting own opinions about something or something on various platforms, such as posting own comments about stocks, political events, entertainment lace news and the like on social platforms such as twitter or newsband or sharing their daily lives, or expressing their feelings about the use of articles bought from amazon, naughty or other shopping websites.

Twitter, for example, is a company that serves american social networking and micro-blogging services, and is dedicated to serving public conversations. It allows users to update messages of no more than 140 characters (up to 280 characters have been raised in addition to chinese, japanese and korean), also known as "tweets" (tweets), which are described as "internet sms". Twitter is very popular worldwide, and the financial reports released by Twitter show that Twitter is active on a monetizable day by 1.87 billion as of the third quarter of 2020, and therefore new tweets are generated every hour and every second. Such massive information data contains rich emotional information, and people share life dynamics or a view of something/something by publishing tweets (for example, when new coronary pneumonia prevails all over the world in the early 2020, a large amount of relevant dynamics or comments are published on Twitter every day).

New comments are generated every minute and every second, so that the massive information data prompts the generation of emotion analysis. Emotion analysis is the computational analysis of the speaker/author's opinion, attitude, mood for a topic and identifies non-trivial, subjective information from a corpus of text. The decision maker can acquire the view of the interest-related person by tracking the text information and performing emotion analysis so as to facilitate better subsequent development. Sentiment analysis is often accompanied by opinion mining and text mining, and the framework mainly comprises the following subtasks of acquiring text data, cleaning and preprocessing the data, converting the text into machine-readable vectors, selecting features, and finally applying natural language processing and a machine learning algorithm. Emotion analysis is a subtask in natural language processing, and has been a hotspot study from 2011. The development of machine learning methods and the ready availability of large amounts of data have led to a great deal of research in emotion analysis. The current foreign scholars' study of emotion processing of text using natural language processing has accumulated a great deal of resources for english predictions and dictionaries. However, current analysis methods for emotion analysis tend to trade time and space complexity and inefficiency for accuracy.

Traditional text modeling methods, such as bag-of-words models, typically model only the syntactic environment of words, destroying the syntactic structure to some extent. To address this problem, researchers have proposed word embedding models that represent words as a continuous, low-dimensional vector of weights. The word embedding model has an inherent difficulty: the polarity is transferred. Polarity shifting refers to the text's emotion being reversed when analyzed (negative text is judged to be positive text and positive text is judged to be negative text). The main cause of the polarity transition is negative terms contained in the text, the negative terms are often appeared in the spoken text, and comments generated by social platforms such as Twitter are very typical short spoken text, so that negative processing is essential to reduce the possibility of the polarity transition when emotion analysis is performed on the short spoken text. However, most current emotion analysis models simply define the influence range of the negative term as those words between the negative term and the first punctuation mark after the negative term, namely, the negative term to the end of the clause. This definition of negation is too simple and ignores the complexity of the language. In addition, the current algorithm model for emotion analysis of the tweet only focuses on the actually acquired original text, and ignores the implicit deep meaning and the opposite relation contained in the text. Therefore, there is still a need for an improved method for emotion analysis of short texts such as tweets. .

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention improves a negative processing method, combines with syntactic analysis and provides a new negative range hypothesis, generates a corresponding data set by a data enhancement method based on an obtained original data set, and provides an emotion analysis model based on a data enhancement technology.

The technical scheme is as follows: the invention discloses an emotion classification model based on data enhancement, and a model construction method thereof comprises the following steps:

(2) preprocessing and cleaning the original data set obtained in the step (1);

(6) generating a data set of the opponent text as an opponent text training set;

(7) and (5) training a classifier model from the original data set and the opposite text training set obtained in the step (6) by adopting a classifier to obtain an emotion classification model.

The invention further preferably adopts the technical scheme that the step (2) of cleaning the original data set comprises the steps of filtering out attributes, mailboxes, special characters and links contained in the text, removing useless stop words, ignoring the text which is forwarded and modified by the user, and deleting repeated items.

Preferably, the specific step of performing negative processing on the text in step (3) is:

3a, detecting a negative trigger item by using a keyword matching technology based on rules;

3b, replacing the detected negative trigger item with a mark 'Negap';

3c, detecting a negative range by combining a conjunctive analysis technology and a punctuation coincidence identification technology;

3d, confirming the negative range.

Preferably, before the text is inverted in step (4), the emoticons existing in the text are marked, and the positive emoticons and the negative emoticons are replaced by the marks EMO _ POS and EMO _ NEG, respectively.

Preferably, the model training in step (7) is performed by using a support vector machine and a logistic regression and naive Bayes classifier.

The emotion classification method comprises the following steps:

(1) constructing the emotion classification model;

(3) preprocessing and cleaning the original test set obtained in the step (2);

(7) generating a data set of the opposition text as an opposition text test set;

Has the advantages that: (1) the method divides the text into the positive text and the negative text, each negative text in the original data set can be changed into one positive text by detecting a negative trigger item and a range thereof, reversing emotion words, reversing polarity labels and the like, and vice versa; a data set composed of contrast texts is called a contrast data set, and a technique used for the contrast data set is called a data enhancement technique; according to the method, a data enhancement technology is utilized to obtain a training set and a test set of the opposite text, the text is converted into a word vector by utilizing word embedding, when the test set is used for carrying out polarity prediction on the text, the positive degree (negative degree) of the original text is considered, the negative degree (positive degree) of the compared text is also considered, the generated data set of the opposite text can effectively utilize deep-level emotion information contained in the text, the text emotion is predicted from positive and negative angles, the robustness of model prediction is increased, the accuracy of model prediction is improved, and the classification precision can be effectively improved through the classification method provided by the invention;

(2) when the negative trigger item and the negative range are detected, the complexity of the language is fully understood, the negative range is not simply defined as all words between the negative trigger item and the first punctuation mark behind the negative trigger item, but the punctuation mark recognition technology and the conjunctive analysis technology are combined, and a plurality of rules are comprehensively considered and defined so as to solve the complex problem that conjunctive words appear in the negative sentence.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

Example (b):

1. the method for constructing the emotion classification model based on data enhancement comprises the following steps:

(2) preprocessing and cleaning the original data set acquired in the step (1), filtering out attributes, mailboxes, special characters, links and the like contained in the text, removing useless stop words, ignoring the text which is forwarded and modified by a user, and deleting repeated items;

the method comprises the following specific steps:

3a, detecting a negative trigger item by using a keyword matching technology based on rules;

3b, replacing the detected negative trigger item with a mark 'Negap';

3c, detecting a negative range by combining a conjunctive analysis technology and a punctuation coincidence identification technology;

3d, confirming the negative range.

In order to more accurately determine the negative trigger and the related negative range, the negative range is not simply defined as those words between the negative term and the first punctuation mark after the negative term, i.e. the negative term to the end of the clause, but the conjunctive analysis technique and the punctuation mark recognition technique are combined to give six negative cases as shown in the following table 1 from the perspective of the parallel word and the turning word respectively:

TABLE 1 six negatives

(4) And marking the expressions in the text, and replacing the positive expressions and the negative expressions by using marks EMO _ POS and EMO _ NEG respectively.

The emoticons included in the common tweet give an emoticon replacement pattern as shown in table 2, which includes 23 typical positive emoticons and 11 typical negative emoticons.

TABLE 2 common emoticon and indicia Replacing

(5) Inverting the text obtained in the step (4), keeping the emotion words in the negative range unchanged, replacing the emotion words outside the negative range according to the antisense words in the emotion dictionary, removing the negative trigger marks, and forming a contrary text, wherein the antisense words are not antisense words in absolute sense and are words with opposite meanings;

(7) generating a data set of the opponent texts as an opponent text training set, and finally forming the opponent text as shown in table 3:

TABLE 3 comparison of original text and contradictory text

(8) And (4) training a classifier model from two aspects of the original data set and the opponent text training set obtained in the step (7) by adopting three traditional classifiers of a support vector machine, logistic regression and naive Bayes to obtain an emotion classification model.

2. The emotion classification method based on the model comprises the following steps:

(1) and respectively obtaining an original test set and an opposite text test set by adopting the same method as the method for obtaining the original data set and the opposite text training set when the model is constructed.

(2) And carrying out emotion analysis on the data in the opposite text test set of the original test set by using the obtained emotion classification model, wherein the final prediction result depends on the comprehensive prediction result of the original test set and the opposite test set.

And a model prediction stage for predicting the label classification of the original sample x by considering two pieces of opposite texts x and x'. The main task is not to predict the class of x 'in the opponent test dataset, but to assist in predicting the class of x by x'. Thus, the model considers not only how positive or negative the original text x is, but also how negative or positive its counterpart text x' is. p (| x) and p (| x ') represent posterior probabilities of the original text x and the opponent text x', respectively. Denotes either positive (+) or negative (-). In the prediction phase, the class of a piece of text needs to consider two aspects of the text.

The positive emotional degree of a test text is measured by two parts, namely

(1) How positive the original test text x is, p (+ | x);

(2) how negative the opponent test text x 'is, p (| x').

The degree of negative emotion of a test text is measured by two parts, namely

(1) How negative the original test text x is, p (| x);

(2) how positive the opposition test text x 'is, p (+ | x').

As shown in table 4, the two datasets used by the present invention are the stanford dataset and the sanderstut emotion corpus dataset, respectively. The Stanford dataset contained 160000 training tweets, while containing 80000 active and passive tweets. Whereas sanders' Twitter mood data set contains 570 positive and 654 negative tweets.

Table 4 detailed data set

The texts of each category in the two data sets are randomly divided into five parts (four parts are used as training data, and the other part is used as test data), and the original training text data set and the original test text data set both adopt a data enhancement technology to generate an opponent training text data set and an opponent test text data set. All experimental results obtained finally were reported and analyzed with an average accuracy of five times cross validation. Using a logistic regression classifier based on the LibLinear toolkit, all parameter values are default values. To prove the effectiveness of our proposed framework, we also used a naive bayes classifier based on polynomial and laplacian smoothed event model and a support vector machine classifier based on LibSVM toolkit. The kernel function in the support vector machine model is a linear kernel with the penalty parameter set to a default value, where the Platt probability output is applied to the approximate posterior probability.

Since the final goal of the invention is to make more robust prediction on the original text, the trade-off parameters are set to avoid inversion at this end and to pay attention to the prediction performance of the original text. Firstly, the invention is based on the Stanford dataset, and the parameter value of the balance parameter is between 0.4 and 0.8 when the prediction precision is higher. Then, a trade-off parameter (0< α < l) is used on the sandersted mood corpus data set as well. Two experiments show that when the alpha value is between 0.4 and 0.8, the classification precision of the model to the two data sets is high. To achieve better and more stable experimental results, we assume α to be 0.5.

The model is supported by a data enhancement technology and is based on three classifiers of a support vector machine, naive Bayes and logistic regression. In the model proposed by the present invention, the best results were obtained by the support vector machine-based classifier, probably because our negation further improved the interpretability of the support vector machine.

The effectiveness of the model in solving the polarity shift during the training phase is illustrated by a text sample.

Original training text: i don't like this eye shadow track, and it is unpretty. tag: negative

The opposite training text: i like this eye shadow track, and it is elegant. tag: positive

In the daily parlance, "like" is considered as a word with a strong positive emotion, but the polarity is shifted due to the presence of the negative word "not" in the original text, and the positive word "like" is erroneously associated with the negative label of the original comment. Reviewing the maximum likelihood estimation, it can be known that the weight of "like" will be added by a negative score. While in the model training phase, the invention removes the negation, the "like" is correctly associated with the positive label of the inverted text, whose weights can be added with a positive score. Based on this, the present invention can draw a conclusion that: the learning error caused by negation can be partially compensated by negation processing in the model training stage.

The text also explains why the present invention considers that the polarity transfer problem can be effectively solved in the prediction stage.

Original test text: i don't like this eye shadow track, and it is unprettty EMO _ NEG.

The opposite test text: i like this eye shadow track, and it is elastic EMO _ POS.

For the original test text, although there is a negative structure, but "like" has a high positivity in predicting the overall direction of the test sample, then the conventional machine learning algorithm may incorrectly predict the original speculative text as a negative class. However, in the prediction stage, the negative term "not" is removed in the data enhancement process, so that the positive effect of the "like" can be smoothly exerted, the positive expression represented by the positive token "EMO _ POS" in the opposite test sample is added, and the possibility that the opposite test sample is divided into the positive classes is extremely high. In addition, in the prediction stage, the original prediction and the opposite prediction are subjected to weighted combination, and the accuracy of the original prediction is assisted by the opposite prediction, so that the learning error of the original prediction is compensated to a certain extent. When the negative range is defined, the negative range is not simply distinguished according to punctuation marks, but complex negations including conjunctions are taken into consideration, and negative coverage is considered more thoroughly, so that the model provided by the invention can effectively reduce some prediction errors caused by polarity drift in a prediction stage, and the robustness of a prediction result is improved.

The analysis proves that the opposite text is generated by using the data enhancement technology under the condition of effectively negating the text, the two emotion opposite faces of one text are fully considered in the model training and predicting stages, the accuracy of emotion analysis can be effectively improved, and the effectiveness and the practicability of the emotion classifier trained by the method are proved from the application perspective.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

10页详细技术资料下载

Emotion classification model and emotion classification method based on data enhancement

相关技术

网友询问留言