Multi-label text data feature selection method and device

文档序号：1242907 发布日期：2020-08-18 浏览：15次中文

阅读说明：本技术 一种多标记的文本类数据特征选择方法及装置 (Multi-label text data feature selection method and device ) 是由孙林王天翔李文凤李梦梦于 2020-04-03 设计创作，主要内容包括：本发明涉及一种多标记的文本类数据特征选择方法及装置,属于文本数据处理技术领域。本发明首先考虑文本数据集中标记与标记之间的二阶相关性,将标记进行分组,使其可以更好的适用于多标记数据集,根据特征在每个标记组计算出的得分确定各特征的最终得分,并从中选取得分较高的设定个数的特征,构成特征集合；然后基于得到的特征集合,根据文本数据集中各样本对于标记的分类间隔确定每个样本的邻域粒度,得到多标记邻域决策系统,利用改进的邻域粗糙集的依赖度计算重要度,对得到的特征集合进行筛选,从而实现对多标记文本数据的特征选择。相较于原始的针对全体属性的邻域粗糙集特征选择方法,本发明的时间复杂度更低,最优特征子集更准确。(The invention relates to a multi-label text data feature selection method and device, and belongs to the technical field of text data processing. The method comprises the steps of firstly, considering second-order correlation between marks in a text data set, grouping the marks to be better suitable for a multi-mark data set, determining the final score of each feature according to the score calculated by each mark group according to the features, and selecting the set number of features with higher scores from the final scores to form a feature set; and then determining the neighborhood granularity of each sample according to the classification interval of each sample to the mark in the text data set based on the obtained feature set to obtain a multi-mark neighborhood decision system, calculating the importance degree by utilizing the dependence degree of the improved neighborhood rough set, and screening the obtained feature set, thereby realizing the feature selection of the multi-mark text data. Compared with the original neighborhood rough set feature selection method aiming at the overall attributes, the time complexity is lower, and the optimal feature subset is more accurate.)

1. A multi-label text data feature selection method is characterized by comprising the following steps:

1) acquiring a text data set containing a plurality of marks;

2) dividing the marks into three types of mark groups of positive correlation, negative correlation and irrelevance according to the second-order correlation between the marks in the text data set;

3) calculating the scores of the features in the mark groups according to the categories of the mark groups, determining the final score of each feature according to the score calculated by the features in each mark group, and selecting a set number of features with higher scores from the final scores to form a feature set;

4) determining the neighborhood granularity of each sample according to the classification interval of each sample to the mark in the text data set to obtain a multi-mark neighborhood rough set;

5) constructing a multi-mark neighborhood decision system according to neighborhood granularity and the feature set, and determining that the set X belongs to under the multi-mark neighborhood decision system_jThe sum of the sets of j 1,2, … M belongs to a setAnd determining the dependency of the multi-label neighborhood rough set, wherein M is the number of decision attributes in the decision set, and X_jAndrespectively representing a sample set which hits the jth mark and a sample set which does not hit the jth mark for the division of the sample set under the jth mark;

6) and calculating the importance of the condition attribute in the multi-marker neighborhood decision system relative to the decision attribute according to a multi-marker neighborhood rough set dependency formula, and screening the condition attribute according to the importance to realize feature selection of the text data.

2. The method of claim 1, wherein each feature score is calculated by the formula:

C＝{f₁,f₂,…,f_mdenotes the feature corpus, L ═ L₁,l₂,…,l_tDenotes a full set of labels, n_kDenotes the number of class k samples, f_j,iDenotes the value of the ith feature in the jth sample, μ_kRepresenting the ith feature f in the sample_iIs determined by the average value of (a) of (b),representing the ith feature f in the sample_iAverage in the k-th class, c denotes the total number of classes, R_g(l_a,l_b) Indicating a mark l_aAnd a label l_bThe correlation weight of (2).

3. The method of claim 1, wherein the sample-to-tag classification interval is:

wherein margin is^l(x) Is a sample x pairIn the mark l_iClassification interval of, NM^l(x) For each heterogeneous sample distance, NH, in ascending order^l(x) For each same kind of sample distance, | NH, ordered in ascending order^l(x) | is the number of samples of the same type, | NM^l(x) I is the number of heterogeneous samples, NM^l(x_i) And NH^l(x_i) Respectively representing a heterogeneous sample close to the sample ith and a homogeneous sample close to the sample ith under the class mark l, delta (x, NM)^l(x_i) And Δ (x, NH)^l(x_i) Respectively represent sample points x to NM^l(x_i) And NH^l(x_i) The distance of (c).

4. The method of claim 3, wherein the neighborhood granularity is calculated by the formula:

whereinFor sample x for marker l_iM is the number of labels, M^l(x) Is the neighborhood granularity of sample x.

5. The method of claim 1, wherein the multi-labeled text-based data feature selection system is MDNS ═ U, C ∪ D, >, U ═ x ═ U, C₁,x₂,…,x_nDenotes a set of text data samples,B＝{f₁,f₂,…,f_Nc is a feature set describing the text data, N ≦ C |, L ≦ L₁,l₂,…,l_MIs the corresponding label set, D ═ l₁,l₂,…,l_mIs the classification decisionThe set of attributes is then selected from the group of attributes,

6. the method of claim 1, wherein the improved dependency of the multi-labeled neighborhood rough set is calculated by the formula:

ρ_B(D) is a weight coefficient, | H: (_B(x_i) Is representative of belonging to set X under feature set B_jJ is the number of sets 1,2, … M, | M | (M |)_B(x_i) Is) represents belonging to a set under feature set BThe number of sets, | U | is the number of samples owned by the training set, | L | is the number of labels owned by the label set, _BNDfor the lower approximation of the coarse set of multi-labeled neighborhoods,_B(x_i) For the set of samples of the ith sample within the neighborhood radius under feature subset B, D^jThe representation has a class label l_jSet of samples of (D)_iRepresents a sample x_iSet of flags, U ═ x₁,x₂,…,x_nDenotes a sample set, B ═ f₁,f₂,…,f_NDescribe feature subsets.

7. The method of claim 6, wherein the importance is calculated by the formula:

where sig (a, B, D) is the importance of conditional attribute a ∈ C-B relative to decision attribute D,to determine the dependency of attribute D on conditional attribute B ∪ a,is a representation of the dependency of decision attribute D on conditional attribute B.

8. A multi-labeled text class data feature selection apparatus comprising a processor and a memory, the processor executing a computer program stored by the memory to implement the multi-labeled text class data feature selection method of any one of claims 1-7.

Technical Field

The invention relates to a multi-label text data feature selection method and device, and belongs to the technical field of text data processing.

Background

Multi-label learning is a research hotspot in the fields of pattern recognition, machine learning, data mining, data analysis and the like. In multi-labeled learning, each instance is not only described by a set of feature vectors, but also corresponds to a plurality of decision attributes. There are also many problems in real life that fall into the category of multi-label learning, such as: a movie may belong to multiple categories simultaneously, such as "action", "science fiction" and "war"; a document may have multiple topics simultaneously, such as "medicine", "science and technology", and "artificial intelligence"; an image may be labeled with multiple semantics such as "street", "car" and "pedestrian" simultaneously. It is difficult to accurately classify the problems using the single label classification method, and therefore, in recent years, researchers have increasingly focused on multi-label learning.

Many challenges are faced in the process of studying multi-label classification: on one hand, each instance may have multiple category labels at the same time, and there is also some correlation between these labels; on the other hand, in multi-labeled data, the dimensionality of the data is usually high, and this may cause dimensionality disaster, which seriously affects the classification performance of the classifier. Therefore, in data preprocessing, a dimension reduction technique is important. Feature extraction and feature selection are main means of feature dimension reduction, and the former converts original high-dimensional features into a new low-dimensional space by a conversion or mapping method; the latter selects an optimal feature subset from the original feature space according to a certain evaluation criterion. There are three main approaches to feature selection to handle multi-labeled data: filtration, wrapping and embedding. The filtering method depends on general characteristics of training data, and a characteristic selection process is used as a preprocessing step, so that the method has lower calculation cost and better generalization capability; the wrapping method uses a basic model to perform multiple rounds of training, removes the characteristics of a plurality of weight coefficients after each round of training, and then executes the next round of training based on a new characteristic set, wherein the method is expensive in calculation; the embedded approach integrates the feature selection process into the training process to reduce the total time for reclassifying the different subsets.

The Fisher Score method evolved by Fisher Discriminant Analysis (FDA) is a relatively common feature selection method under supervised learning. In 2002, Guyou et al propose an F-score feature selection formula very similar to Fisher discriminant analysis; subsequently, Chen et al proposed an expression based on the F-score of the binary problem; in 2010, Salih et al first improved F-Score, so that the improved F-Score can be applied to multi-classification problems; in 2011, Gu and the like consider the correlation and redundancy among characteristics, further improve the F-Score and propose a generalized Fisher Score; in 2012, modification of the multi-classification Fisher Score was performed by Xiejan English et al in consideration of dimension problems between features; in 2013, Tao et al consider the overlapping between categories and consistency of features, and add a weighting coefficient to the traditional formula. However, conventional Fisher Score can typically only be computed for a single tagged dataset.

Feature selection is an essential preprocessing link in multi-label learning. Multi-label learning is commonly used to handle many complex tasks. Among the various feature selection methods, the rough set has attracted much attention as a specific granular computational model, and due to the following advantages: the ability to discover data dependencies and reduce the number of features without any other information and under the constraints of a limited set of information, using only the attributes contained in the data set. Zhang and Li provide a fractal endpoint detection multi-mark algorithm based on a rough set to keep better performance and process noise with higher irregularity than voice; the main strategy of converting a multi-label feature selection task into a plurality of binary single-label feature selection tasks in a feature selection task is called problem conversion, but the problem conversion cuts off the relation among labels, and unbalanced data is easily generated. The traditional rough set model can only process discrete data, and for data containing real value or noise data, discrete preprocessing is usually adopted, which may result in low classification precision. In order to overcome the defect, many researchers supplement and improve the traditional rough set theory, for example, Li and the like research a feature reduction method based on a neighborhood rough set and a distinguishing matrix; zhang et al propose different fuzzy relations based on different types of attributes to measure the similarity between samples, and propose some robust Fuzzy Rough Set (FRS) models to enhance the robustness of classical FRS; wang et al constructs a local neighborhood rough set to process the labeled data. In the improvement mode, in the calculation of the importance degree, the judgment condition is too strict, and the characteristics with similar importance degrees can not be further judged, so that the finally selected characteristics are not accurate enough.

Disclosure of Invention

The invention aims to provide a multi-label text data feature selection method and a multi-label text data feature selection device, and aims to solve the problems of low accuracy and complex algorithm of the existing multi-label text data feature selection method.

The invention provides a multi-label text data feature selection method for solving the technical problems, which comprises the following steps:

1) acquiring a text data set containing a plurality of marks;

2) dividing the marks into three types of mark groups of positive correlation, negative correlation and irrelevance according to the second-order correlation between the marks in the text data set;

4) determining the neighborhood granularity of each sample according to the classification interval of each sample to the mark in the text data set to obtain a multi-mark neighborhood rough set;

5) constructing a multi-mark neighborhood decision system according to neighborhood granularity and the feature set, and determining that the set X belongs to under the multi-mark neighborhood decision system_jThe sum of the sets of j 1,2, … M belongs to a setAnd determining the dependency of the multi-label neighborhood rough set, wherein M is the number of decision attributes in the decision set, and X_jAndscribing sample set under jth markRespectively representing a sample set which hits the jth mark and a sample set which does not hit the jth mark;

6) and calculating the importance of the condition attribute in the multi-marker neighborhood decision system relative to the decision attribute according to the dependence of the multi-marker neighborhood rough set, and screening the condition attribute according to the importance to realize feature selection of the text data.

The invention also provides a multi-labeled text data feature selection device, which comprises a processor and a memory, wherein the processor executes a computer program stored by the memory to realize the multi-labeled text data feature selection method.

The method comprises the steps of firstly, considering second-order correlation between marks in a text data set, grouping the marks, calculating scores of features in each mark group according to the category of the mark group, improving a Fisher-Score method to be better suitable for a multi-mark data set, determining the final Score of each feature according to the Score calculated by each mark group according to the features, and selecting the set number of features with higher scores to form a feature set; and then based on the obtained feature set, determining the neighborhood granularity of each sample according to the classification interval of each sample to the mark in the text data set to obtain a multi-mark neighborhood rough set, calculating the importance degree by utilizing the dependency degree of the neighborhood rough set, and screening the obtained feature set again so as to realize feature selection of the multi-mark text data. Compared with the original neighborhood rough set feature selection algorithm aiming at the overall attributes, the time complexity is lower, and the optimal feature subset is searched more accurately.

Further, to better fit into the multi-labeled text dataset, the calculation formula for each feature score is:

Further, to avoid noise data interference, the classification interval of the sample to the marker is:

wherein margin is^l(x) For sample x for marker l_iClassification interval of, NM^l(x) For each heterogeneous sample distance, NH, in ascending order^l(x) For each same kind of sample distance, | NH, ordered in ascending order^l(x) | is the number of samples of the same type, | NM^l(x) I is the number of heterogeneous samples, NM^l(x_i) And NH^l(x_i) Respectively representing a heterogeneous sample close to the sample ith and a homogeneous sample close to the sample ith under the class mark l, delta (x, NM)^l(x_i) And Δ (x, NH)^l(x_i) Respectively represent sample points x to NM^l(x_i) And NH^l(x_i) The distance of (c).

Further, in order to more accurately divide the neighborhood rough set, the calculation formula of the neighborhood granularity is as follows:

whereinFor sample x for marker l_iM is the number of labels, M^l(x) Is a neighborhood of the sampleParticle size.

Further, the multi-label neighborhood decision system is MDNS ═ U, C ∪ D, >, U ═ x ═ X₁,x₂,…,x_nDenotes a set of text data samples,B＝{f₁,f₂,…,f_Nc is a feature set describing the text data, N ≦ C |, L ≦ L₁,l₂,…,l_MIs the corresponding label set, D ═ l₁,l₂,…,l_mIs the set of classification decision attributes,

further, in order to effectively reduce the risk that the important attribute is overlooked, the calculation formula of the dependency of the multi-labeled neighborhood rough set is as follows:

Further, the calculation formula of the importance is as follows:

Drawings

FIG. 1 is a schematic diagram of a classification interval of a sample according to an embodiment of the present invention;

FIG. 2 is a flow chart of a multi-labeled text class data feature selection method of the present invention;

FIG. 3-a is a schematic diagram showing the comparison of the index AP of the present invention with that of the prior art method under the Business data set in the experimental example;

FIG. 3-b is a graphical representation of a comparison of the index CV of the present invention with the prior art method under the Business data set in the experimental example;

FIG. 3-c is a graph showing the comparison of the HL level of the present invention with that of the prior art method under the Business data set in the experimental example;

FIG. 3-d is a graphical representation of the RL comparison of the metrics of the present invention and the prior art method under the Business data set in the experimental example;

FIG. 3-e is a schematic representation of the index MicF1 of the present invention compared to the prior art method under the Business data set in the experimental example;

FIG. 4-a is a schematic diagram of the comparison of the index AP of the present invention with the index AP of the prior art method under the Computer data set in the experimental example;

FIG. 4-b is a graphical representation of the comparison of the index CV of the present invention with that of the prior art method under the Computer data set in the experimental example;

FIG. 4-c is a graphical representation of the HL comparison between the present invention and the prior art method in the Computer data set of the experimental example;

FIG. 4-d is a schematic diagram showing the comparison of the RL index of the present invention with that of the prior art method under the Computer data set in the experimental example;

FIG. 4-e is a schematic representation of the MicF1 index of the present invention compared to the prior art method under the Computer data set in the experimental example;

fig. 5 is a block diagram showing the structure of the multi-labeled text data feature selection apparatus according to the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

Method embodiment

Before describing the concrete means of the invention, some knowledge, Fisher-Score algorithm and neighborhood rough set algorithm are described.

1) Mutual information correlation concept

Assuming that A, B are two events and P (A) >0, the conditional probability of event B occurring under the condition of event A occurring is:

for a discrete random variable X ═ X₁,x₂,…,x_nThen, the information entropy of the random variable X can be represented as:

in the formula, P (x)_i) For the occurrence of an event x_iThe probability of (d); n is the total number of events (states) that can occur. Obviously, for a fully determined variable X, h (X) ═ 0; for random variable X, there are H (X)>0 (non-negative), and the value of h (x) increases (increases) with the increase of the state number n, i.e. the larger the number of values of the random variable, the more the state number, the larger the information entropy, the larger the chaos degree, and when the random distribution is uniform, the entropy is the largest.

For theTwo different discrete random variables X ═ X₁,x₂,…,x_nY ═ Y₁,y₂,…,y_mAnd defining the joint entropy of the random variable X and the random variable Y as follows:

wherein, P (x)_i,y_j) Is x_iAnd y_jOf (2), i.e. event x_iAnd y_jProbability of coincidence.

For two different discrete random variables X ═ X₁,x₂,…,x_nY ═ Y₁,y₂,…,y_mThen, the conditional entropy of the random variable X for the random variable Y can be defined as:

wherein, P (y)_j) For the individual occurrence of an event y_jProbability of p (x)_i|y_j) To be at occurrence of event y_jUnder the condition of (1) event x_iConditional probability of occurrence. Obviously, when X and Y are completely independent, there is H (X | Y) ═ H (X); and when X and Y are fully related, H (X | Y) ═ 0; for general relevant variables, there are H (X | Y)>0. Similarly, we can also define the conditional entropy of the random variable Y to the random variable X as:

for the whole variable X, the entropy value with reduced uncertainty due to the occurrence of the variable Y and the correlation between the two is called mutual information, and is defined as follows:

I(X,Y)＝H(X)-H(X|Y) (6)

where H (X) is the information entropy of the random variable X, and H (X | Y) is the conditional entropy of the random variable X with respect to the random variable Y. It can be proved that mutual information is non-negative, i.e. I (X, Y) ≧ 0, and also has reciprocity, i.e.:

I(X,Y)＝H(X)-H(X|Y)＝H(Y)-H(Y|X)＝I(Y,X) (7)

the joint entropy of the random variable X and the random variable Y is:

in the formula, H (X) and H (Y) represent information entropies of the random variable X and the random variable Y, respectively, and H (X, Y) represents a joint entropy of the random variable X and the random variable Y.

The mutual information has the defect of no normalization, and in order to compare the degree of interdependence between different variables, the normalization can be carried out by using a generalized correlation function

Wherein R is more than or equal to 0_g(X, Y) ≦ 1, and it can be seen that when random variable X and random variable Y are fully correlated, there are I (X, Y) ═ H (X) ═ H (Y), R_g(X, Y) ═ 1; when X and Y are completely independent, I (X, Y) ═ 0, R_g＝0，R_gThe larger the value of (a) is, the stronger the correlation between the random variable X and the random variable Y is.

The mutual information measures the statistic independent relation between a certain characteristic and a certain category, and then the statistical independent relation is aimed at the characteristic f and the category l_iThe mutual information formula can be defined as

Wherein, P (f, l)_i) Representing that the training set contains both the feature f and belongs to the class l_iP (f) represents the probability that the training set contains the feature f, P (l)_i) Indicating class in training set as belonging to l_iProbability of (1), P (f | l)_i) Is shown in category l_iIncluding the probability of the feature f. As can be seen from equation (10), when P (f | l)_i)>P(f)，MI(f,l_i)>0, the feature f and class l are illustrated_iIs positively correlated, while MI (f, l)_i) Greater values of (A) indicate a featuref and class l_iThe stronger the positive correlation; on the contrary, when P (f | l)_i)<P(f)，MI(f,l_i)<0, the feature f and class l are illustrated_iIs negatively correlated, while MI (f, l)_i) Smaller values of (A) indicate feature f and class l_iThe stronger the negative correlation.

2) Fisher-Score algorithm

Fisher-Score is an effective standard for judging sample characteristics, and the traditional Fisher-Score is derived from a Fisher linear discriminant method, and essentially selects the characteristics with small intra-class difference and large inter-class difference.

Given a feature set f₁,f₂,…,f_mGet the training sample x from c (c ≧ 2) categories_j∈R^mJ-1, 2, …, N, defines the i-th feature f of the training sample_iInter-class divergence of S_b(f_i) And class k samples in the ith feature f_iDivergence in classComprises the following steps:

in the formula, n_kIs the number of class k samples,is the mean value, μ, of class k samples under the i characteristic_iIs the average of the whole sample under the ith feature,for the jth sample in the kth class sample in the ith feature f_iThe following values. Thus, the ith feature f of the training sample can be obtained_iThe Fisher Score of (A) is:

as can be seen, the ith feature f in equation (13)_iDivergence S between classes_b(f_i) The larger the sum of the within-class divergence of the c classesThe smaller the size, the smaller FS (f)_i) The larger the value of (A), the more characteristic f is specified_iThe stronger the identification power of (2), the greater the importance of the feature.

3) Neighborhood rough set algorithm

Let U denote the sample space and x be a given sample, the classification interval for sample x is expressed as:

margin(x)＝Δ(x-NM(x))-Δ(x-NH(x)) (14)

where nh (x) represents the homogeneous sample closest to sample x in sample space U, called the neastet (nh) of x. And NM (x) represents the heterogeneous sample closest to sample x in sample space U, called the Nearest Miss (NM) for x. Δ (x-NM (x)) and Δ (x-NH (x)) represent the distance from sample point x to NM (x) and NH (x), respectively (see FIG. 1).

Suppose U is a sample space, forx may be subordinate to the set of labels L ═ L₁,l₂,…,l_tIn givenThe classification interval of sample x under the label l is defined as:

m^l(x)＝Δ(x,NM^l(x))-Δ(x,NH^l(x)) (15)

wherein NH^l(x) Indicating the root of the same class in the sample space U closest to x under the class label l; NM^l(x) Indicating the heterogeneous sample closest to the sample under the class label l. Delta (x-NH)^l(x) And Δ (x-NM)^l(x) Respectively represent sample points x to NH^l(x) And NM^l(x) The distance of (c).

Assuming that the sample space is U, the set of labels L ═ L₁,l₂,…,l_tForGiven aClass interval m when sample x is under label l^l(x) And >0, the neighborhood of x is:

(x)＝{y|Δ(x,y)≤m^l(x),y∈U} (16)

when m is^l(x) When m is less than or equal to 0, let m^l(x)＝0。

Let DS ≧ U, Δ > be the non-null metric space, x ∈ U, ≧ 0 denote the neighborhood with the point set x, expressed as:

(x)＝{y|Δ(x,y)≤,y∈U}. (17)

consider the set of all samples U ═ x₁,x₂,…,x_n},A＝{a₁,a₂,…,a_NIs a set of conditional attributes describing the sample, D ═ l₁,l₂,…,l_mIs the set of classification decision attributes, given<U,A,D>If A generates a set of neighborhood relationships, then it is said<U,A,D>Is a neighborhood decision system.

Given a non-empty finite set Ω in real space and a neighborhood relation N above it, i.e. a doublet NS ═ N<U,N>，{X₁,X₂,…,X_nX is in the neighborhood approximate space NS<U,N>The upper and lower approximations in (a) are respectively:

the approximate boundaries of X are:

for single label learning, the lower approximation of the neighborhood rough set embodies the ability of the attribute set to classify samples by borrowing the concept of the neighborhood. In multi-label learning, the definition of the lower approximation is also similar. The related concepts and properties of the multi-labeled neighborhood rough set model are given below.

Decision making system MNDS in multi-label neighborhood<U,C,D,f,Δ,>In, mark set L ═ { L ═₁,l₂,…,l_m}，D_jThe representation has a class label l_jSet of (2), DⁱRepresents a sample x_iHaving a set of marks givenThe approximate space of the multi-labeled neighborhood rough set is defined as:

the approximate boundary of X is

Decision making system MNDS in multi-label neighborhood<U,C∪D>In the above-mentioned step (c), in (d), _BNd is called the positive domain of multi-label classification under the knowledge level given by the attribute B and is marked as POS_B(D) In that respect Thus, the dependency of the multi-label classification can be expressed as:

MNDS (MNDS ═ in multi-label neighborhood decision system)<U,C∪D>When 0 is less than or equal to r_B(D) 1. then:

1) when r is_B(D) D is strongly dependent on B when 1.

2) When 0 is present<r_B(D)<At 1, D is weakly dependent on B.

3) When r is_B(D) At 0, D is completely independent of B.

The definition of the dependency degree reflects the importance of the decision attributes to the condition attributes, so that the dependency degree of the result classification attributes on the condition attributes can be inspected, and the key attributes which play a decisive role in classification can be effectively found. Thus, the importance of the conditional attribute a ∈ C-B on the conditional attribute B relative to the decision attribute set D can be expressed as:

sig(a,B,D)＝γ_B∪{a}(D)-γ_B(D). (25)

as can be seen from the definition of the attribute importance, when sig (a, B, D) is 0, the attribute a is redundant. And there are two cases: one is that the attribute a is not related to the current classification task, and the other is that the classification task included in the attribute a is already included in other attributes, which is also called that the attribute is redundant.

On the basis of the technology, firstly, the Fisher-Score method is improved by combining the mutual information theory basis and the second-order mark correlation, so that the Fisher-Score method can be better suitable for a multi-mark data set; then, calculating the Score of each feature according to an MLFisher-Score (improved Fisher-Score) method, and sequencing the calculation results in a descending order to obtain a feature sequence; selecting attributes with higher scores from the characteristic sequences obtained by calculation through an MLFisher-Score method; and finally, under the attributes, according to a neighborhood rough set improved by each sample in the text data set for the labeled classification interval, using an attribute dependency and an importance calculation formula in the improved neighborhood rough set to select the features of the multi-labeled text, wherein the implementation flow of the method is shown in FIG. 2, and the specific process is as follows.

1. Multi-labeled text data is acquired.

2. According to the second-order correlation between the marks in the acquired text data set, the marks are divided into three types of mark groups, namely positive correlation, negative correlation and non-correlation.

The marker sets of the multi-marker data are all binary distributions, i.e., present or absent. In order to better recognize whether the relationship between two marks is a positive correlation, a negative correlation or an uncorrelated correlation, the invention calculates the correlation between two marks according to the formula (26) on the basis of the formula (10),

for a given multi-label dataset MNDS ═ X, C ∪ L >, X ═ X₁,x₂,…,x_nDenotes the sample corpus, C ═ f₁,f₂,…,f_mDenotes the full set of attributes, L ═ L₁,l₂,…l_tDenotes a complete set of labels, then label l_iAnd a label l_jThe correlation of (A) is:

wherein, P (l)_j) Is marked with a_jNext, the probability of a tag hit, P (l)_j|l_i) Is shown at the symbol l_iOn the premise of hit, label l_jThe probability of a hit. By the same token, the available markAnd a markThe correlation of (a):

the label l is calculated by formula (26) and formula (27), respectively_iAnd a label l_jCorrelation and labeling ofAnd a markThrough the two formulas, i define a new calculation mode of the correlation between the marks,namely, it is

Wherein MI (l)_i|l_j) Is marked by_iAnd l_jThe correlation of (c);is marked byAnd a markThe correlation of (a) with (b) is,andindicates miss tag l_iAnd l_j。

By analyzing the whole tag set, the tag set in most multi-tag data sets is found to be a sparse matrix, that is, the number of missed tags is much larger than that of hit tags, and obviously, for this reason, the knowledge importance of two tags hitting at the same time is much larger than that of two tags missing at the same time. To address this situation, a corresponding improvement is made to equation (28), as follows:

in the formula, θ is an importance parameter and is calculated as follows

Wherein the content of the first and second substances,indicating the number of hits for all tags in the tag set,the j mark of the ith sample is shown, nt is the total number of samples multiplied by the total number of marks, obviously, 0 is larger than or equal to theta and smaller than or equal to 1, and the more sparse the matrix is, the smaller the value of theta is, and the more the correlation weight of simultaneous hit of the marks is. The positive and negative correlation between the marks is calculated by equation (29), if p_ijIf the result of (2) is greater than 0, the indication is that the label is l_iAnd a label l_jIs positively correlated; if ρ_ijIf the result of (A) is less than 0, the indication is that the label is l_iAnd a label l_jIs negatively correlated; if ρ_ijIf the result of (2) is equal to 0, the flag l is indicated_iAnd a label l_jAre not relevant. However, it is clear when ρ_ijIs close to 0, it is clear that the independence between two markers is far beyond that of the point, so we first work on p_ijIs normalized so that rho is obtained_ijAll values of (A) are mapped in [ -1,1 []Within this interval, then, it is specified when | ρ |_ijWhen | ≦ 0.2, the mark l is indicated_iAnd a label l_jAre not relevant; when-1 is less than or equal to rho_ijWhen < -0.2 >, the label l is indicated_iAnd a label l_jIs negatively correlated; when 0.2 < rho_ijWhen the value is less than or equal to 1, the indication mark l_iAnd a label l_jAre positively correlated.

If two tags are considered as a group, then all tag groups have the following four cases: (1) the four cases are respectively marked as {1,1}, {1,0}, {0,1}, {0,0}, {1,0} and {0,1} as the same category for convenience of description, and then the features should be considered to better distinguish the category from the three cases of {1,1}, {0,0 }; secondly, regarding the mark group with negative correlation, two cases of {1,1} and {0,0} are regarded as the same category, and considering which features can better distinguish the category from three cases of {1,0} and {0,1 }; finally, for unrelated marker sets, without ignoring any set of cases, consider which features can better distinguish {1,1}, {1,0}, {0,1} from {0,0 }.

If there is a significant correlation between two labels, for example, there are two categories "finance" and "economy" in the text classification, it is obvious that there is a strong positive correlation between the two categories, that is, the two labels often appear at the same time or do not appear at the same time, at this time, the two cases are taken as the main, the other cases are taken as the auxiliary, that is, the two categories appear at the same time and do not appear at the same time can be regarded as two opposite subjects respectively, and the other cases can not be ignored completely, and there may be some other labels or some key features determined, so at this time, three cases of { {1,0}, {0,1} }, {1,1} and {0,0} are considered, and the case of negative correlation and no correlation are the same.

3. And calculating the scores of the features in the mark groups according to the categories of the mark groups, determining the final score of each feature according to the score calculated by the features in each mark group, and selecting a set number of features with higher scores from the final scores to form a feature set.

If a feature is discriminative, the variance between the feature and the same class of samples should be as small as possible, and the variance between the feature and different classes of samples should be as large as possible, so as to facilitate subsequent operations such as classification and prediction. However, due to the different strong and weak correlations among the target groups, it is obvious that the Fisher-Score method is used for feature selection under the feature group with stronger correlation, and the result is more qualified. The original fisher-score calculation formula only can consider data of a single mark, while text data mostly belongs to the category of multiple marks, and according to second-order correlation among all marks, namely, the correlation between one mark and another mark is positive correlation or negative correlation or irrelevant, namely, whether one mark is generated or not provides more information for the generation or not of another mark. However, since the data amount is too large, the correlation between the markers is difficult to be determined by some fixed value, and therefore, the correlation strength between the markers is analyzed according to the calculation result of the formula (29), that is, the correlation strength between the markers is different according to the value calculated by the formula (29), and then, the weight of the knowledge information analyzed for the group of markers is different according to the correlation strength, that is, the higher the weight of the knowledge provided by the group of markers having stronger correlation is, the higher the score contribution to the feature is.

For a given multi-label dataset MNDS ═ X, C ∪ L >, X ═ X₁,x₂,…,x_nDenotes the sample corpus, C ═ f₁,f₂,…,f_mDenotes the feature corpus, L ═ L₁,l₂,…,l_tDenotes the marker corpus. For theThe score of each feature in the formed marker set is as follows:

wherein n is_kDenotes the number of class k samples, f_j,iDenotes the value of the ith feature in the jth sample, μ_kRepresenting the ith feature f in the sample_iIs determined by the average value of (a) of (b),representing the ith feature f in the sample_iThe average value in the k-th class, c represents the total number of classes, and the value of c is different according to the difference of the correlation degree, when l is_aAnd l_bWhen positive or negative correlation is present, c has a value of 3, when l_iAnd l_jWhen not relevant, c has a value of 4, R_g(l_a,l_b) (formula (9)) represents a symbol l_aAnd a label l_bThe correlation weight of (2). As can be seen from the formula, when the label l_aAnd a label l_bThe stronger the correlation of (2), the higher the feature score calculated at that time, i.e., the greater the importance of the score calculated for the more correlated marker group.

And carrying out weighted average on the feature scores calculated in each mark group, and finally carrying out descending arrangement to obtain a feature sequence after preprocessing the multi-mark data set, wherein the feature sequence is also called a feature set.

4. And determining the neighborhood granularity of each sample according to the classification interval of each sample to the mark in the text data set to obtain a multi-mark neighborhood rough set.

In the original boundary domain calculation mode, only the distance between the target sample and the nearest homogeneous sample and the nearest heterogeneous sample is considered, so that the calculation is very sensitive to noise. When the text data set is analyzed, the original granularity calculation formula is easily interfered by noise data, partial samples are considered through Euclidean distance instead of all samples, the interference of the noise data can be effectively avoided, meanwhile, when the analyzed samples are noise samples, the noise samples can be more accurately discarded by using the improved granularity calculation formula, and the problem that calculation results have large deviation when target samples are noise or are close to the noise samples is avoided.

For a given multi-labeled neighborhood decision system MDNS < U, C ∪ D, > U x₁,x₂,…,x_nThe representation is of a set of samples,B＝{f₁,f₂,…,f_Ndescribe a subset of features, N ≦ C |, L ═ L₁,l₂,…,l_MIs the corresponding set of labels,target sample x versus marker l_iThe classification interval of (1) is:

wherein, NM^l(x) For each heterogeneous sample distance, NH, in ascending order^l(x) For each same kind of sample distance, | NH, ordered in ascending order^l(x) | is the number of samples of the same type, | NM^l(x) I is the number of heterogeneous samples, NM^l(x_i) And NH^l(x_i) Is divided intoThe category indicates the heterogeneous sample close to the ith sample and the homogeneous sample close to the ith sample under the category label l. Δ (x, NM)^l(x_i) And Δ (x, NH)^l(x_i) Respectively represent sample points x to NM^l(x_i) And NH^l(x_i) The distance of (c). If the result margin is calculated^l(x) If the result of (1) is less than 0, then the sample is likely to be noisy, and then the margin of the sample is assigned^l(x) The neighborhood radius of each sample under all labels is defined as follows.

It can be seen that equation (34) is the neighborhood granularity for each sample. At this time, the neighborhood granularity of each sample is already calculated, and according to the new neighborhood granularity, the approximate calculation formula of the multi-marker neighborhood rough set is as follows:

wherein the content of the first and second substances,_B(x_i) Is the set of samples within the neighborhood radius of the i-th sample under feature subset B calculated by equation (34).

5. Constructing a multi-mark neighborhood decision system according to neighborhood granularity and the feature set, and determining that the set X belongs to under the multi-mark neighborhood decision system_jThe sum of the sets of j 1,2, … M belongs to a setAnd determining the dependency of the rough set of the multi-label neighborhood according to the set number.

When the original multi-marker neighborhood rough set is used for feature selection, the division effect of the original multi-marker neighborhood rough set is often poor when the dependency degree is calculated in the original multi-marker neighborhood rough set mode; in addition, the traditional multi-marker neighborhood rough set feature selection algorithm only considers the situation that a sample with which characteristics has a high probability of having which markers, and ignores the situation that the sample with which characteristics has a high probability of not having which markers. Therefore, in order to solve the problems, the invention improves the dependency function of the original neighborhood rough set.

The samples were divided as follows: u ═ x₁,x₂,…,x_nDenotes a sample set, L ═ L₁,l₂,…,l_MIs the corresponding label set, then there are:

wherein, X_jAndthe division of the sample set under the jth mark respectively represents the sample set hitting the jth mark and the sample set not hitting the jth mark.

Given the multi-labeled neighborhood decision system MDNS ═ U, C ∪ D, >, U ═ x₁,x₂,…,x_nThe representation is of a set of samples,B＝{f₁,f₂,…,f_Ndescribe a subset of features, N ≦ C |, L ═ L₁,l₂,…,l_MIs the corresponding mark set, and for two sets divided as above, X ═ X₁,X₂,…,X_MAndthen have the following definitionsMeaning:

according to two divided sets X ═ X₁,X₂,…,X_MAnddecision attribute D vs. conditional attribute subsetThe dependency of (d) can be expressed as:

where ρ is_B(D) Is a weight coefficient, | H: (_B(x_i) Is representative of belonging to set X under feature set B_jJ is the number of sets 1,2, … M, | M | (M |)_B(x_i) Is) represents belonging to a set under feature set BThe number of sets, | U | is the number of samples owned by the training set, | L | is the number of labels owned by the label set. From equation (41), it can be derived that 0 ≦ ρ_B(D) Less than or equal to 1, and is reacted withHaving the same monotonicity, improved dependence formula gamma_B(D) The following properties are still met:

1) when gamma is_B(D) D is strongly dependent on B when 1;

2) when 0 < gamma_B(D) When < 1, D is paired with BIs dependent;

3) when gamma is_B(D) When 0, D is totally independent of B.

The text data set often has the characteristics of large data scale, high latitude and the like, and an original dependency calculation formula is used, so that the determination of the whole granularity by a newly added dimension or characteristic in a reduction subset usually shows a certain salary, namely, the change of a sample in a neighborhood is not large, so that the number of samples in an upper approximate sample set and a lower approximate sample set cannot be changed greatly along with the addition of a characteristic, and a problem is caused, namely, the relative importance of some characteristics cannot be accurately judged, and a certain characteristic in the characteristics possibly contains vital information, but the key characteristic is ignored due to the reason, so that the quality of the characteristic subset is fundamentally reduced; the invention adopts an improved dependency calculation mode, namely a formula (42), so that the mapping range can be effectively expanded, and the risk that important attributes are ignored is effectively reduced.

In the multi-label neighborhood decision system, the definition of the dependency degree reflects the importance degree of the decision attribute to the condition attribute, and the dependency degree of the result classification attribute to the condition attribute can be examined, and the key attribute which determines the classification can be found, so that the purposes of feature selection and minimum feature subset discovery are achieved.

6. And calculating the importance of the condition attribute in the multi-marker neighborhood decision system relative to the decision attribute according to the dependence of the multi-marker neighborhood rough set, and screening the condition attribute according to the importance to realize feature selection of the text data.

In the multi-marker neighborhood decision system MNDS ═ U, C ∪ D, >,if gamma is_B(D)≠γ_B-a(D) Then it is said that a is necessary to relatively decide attribute D in B, otherwise it is not necessary.

In the multi-marker neighborhood decision system MNDS ═ U, C ∪ D, > medium,if:

(1)γ_B(D)＝γ_C(D)

(2)

b is called an attribute reduction of C, that is, if the calculated dependency degree under the current feature subset B is equal to the calculated dependency degree under the feature full set C, the process is terminated, and the feature subset B at this time is the final selected feature set; in the formula, gamma_B(D) Representing the dependency of decision attribute D on conditional attribute B, and on the basis of a neighborhood rough set dependency calculation mode formula (42), aiming at any attribute subsetThe formula for calculating the importance of the conditional attribute a ∈ C-B relative to the decision attribute D is:

from the viewpoint of attribute dependency, the importance of the attribute may provide an effective feature selection method, and if sig (a, B, D) is 0, it indicates that the attribute a is a redundant attribute or an irrelevant attribute, i.e., the attribute a is not relevant to the current classification task or the classification information included in the attribute a is already included in other attributes. Therefore, according to the importance degree, each attribute in the feature set can be screened, and redundant attributes or irrelevant attributes can be removed.

Device embodiment

The apparatus proposed in this embodiment, as shown in fig. 5, includes a processor and a memory, where a computer program operable on the processor is stored in the memory, and the processor implements the method of the above method embodiment when executing the computer program.

That is, the methods in the above method embodiments should be understood that the flow of the multi-labeled text data feature selection method may be implemented by computer program instructions. These computer program instructions may be provided to a processor such that execution of the instructions by the processor results in the implementation of the functions specified in the method flow described above.

The processor referred to in this embodiment refers to a processing device such as a microprocessor MCU or a programmable logic device FPGA;

the memory referred to in this embodiment includes a physical device for storing information, and generally, information is digitized and then stored in a medium using an electric, magnetic, optical, or the like. For example: various memories for storing information by using an electric energy mode, such as RAM, ROM and the like; various memories for storing information by magnetic energy, such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, and U disk; various types of memory, CD or DVD, that store information optically. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.

The apparatus comprising the memory, the processor and the computer program is realized by the processor executing corresponding program instructions in the computer, and the processor can be loaded with various operating systems, such as windows operating system, linux system, android, iOS system, and the like.

As other embodiments, the device can also comprise a display, and the display is used for displaying the diagnosis result for the reference of workers.

In order to comprehensively evaluate the invention, the following establishes that the test data set is used for carrying out experiments on the invention, judging the effectiveness of the invention and comparing the effectiveness with other existing algorithms on each index.

Two multi-label text class datasets were selected for this experiment, and the specific description of the datasets is shown in table 1. The data set may download http: net/datasets.html. In order to evaluate the effectiveness of the algorithm proposed by the present invention, a comparison was made with four existing multi-marker feature selection algorithms: MDFS, MDFS-O (manipulated customized provisioning selection), MSSL (multi-label provisioning selection visual customization and customization), GLOCAL (multi-label provisioning with global and label correction).

These experiments were run on a MATLAB2016b platform, based on Windows 10 with a 3.00GHz processor and 8.00GB memory space. The ML-KNN of a multi-label classification model is used for experimental evaluation, wherein a smoothing parameter is set to be 1, a neighborhood granularity k is set to be 10 (parameter setting), in order to reduce errors, a training set is divided into 10 parts, and calculation is carried out in a mode of ten times of cross validation and averaging.

TABLE 1

In the first part, five evaluation indexes, namely characteristic number (N) and Average Precision (AP), Coverage (CV), Hammingloss (HL), Ranking Loss (RL) and Micro-averaging (MicF1), are selected to analyze and measure experimental results.

Let the test set beAccording to a prediction function f_l(x) The ordering function can be defined as rank_f(x,l)∈{1,2,…,l}。

N: the number of features selected after dimensionality reduction.

Average Precision (AP): in the predictive label ordering for all samples, the average of the probabilities that a label that is ranked before the label of the sample still belongs to the label of the sample is defined as:

wherein R is_i＝{l|Y_il+1 represents and sample x_iSet of related labels, R_i＝{l|Y_il1 represents and sample x_iA set of unrelated labels.

Coverage (cv): the metric for averaging how many steps are needed to find each sample to traverse all the tokens associated with that sample is defined as follows:

hamming Loss (HL): the situation for measuring the misclassification of a sample on a single class label is defined as:

whereinIndicating an exclusive or operation.

Rankine Loss (RL): the average of the probabilities of the order of the uncorrelated marks preceding the correlated marks used to examine all samples is defined as:

micro-averaging (MicF 1): the average value of the average of the corresponding elements of each confusion matrix is defined as:

wherein micp_ijAnd micr_ijRespectively, micro precision and micro recall.

In the above 5 evaluation indexes, the larger the values of the indexes AP and MicF1 are, the better the classification performance is; the smaller the values of indexes CV, HL and RL are, the better the classification performance is, and the optimal value is 0.

3-a, 3-b, 3-c, 3-d, 3-e are graphs comparing the evaluation indexes of the present invention under the text class data set Business with four other multi-labeled feature selection algorithms, and 4-a, 4-b, 4-c, 4-d, 4-e are graphs comparing the evaluation indexes of the present invention under the text class data set Computer with four other multi-labeled feature selection algorithms, wherein the evaluation indexes are averageprecision, coverage, hamming loss, ranking loss and micro-F1. In these pictures, only the indices under the first 100 features are considered, and the line graphs are made at intervals of 10.

The first part of the experiment, FIGS. 3-a, 3-b, 3-c, 3-d, 3-e are the comparison of the indices Average Precision (AP), Coverage (CV), Hamming Loss (HL), Ranking Loss (RL), and Micro-F1(mic) in the data set Business, respectively. As can be seen from the figure, the performance of the invention under five indexes is better than that of other four algorithms, and when the selected characteristics are more, the performance of the invention under the Hamming Loss (HL) index is poorer. Then, 4-a, 4-b, 4-c, 4-d, 4-e are the comparison results of the indexes Average Precision (AP), Coverage (CV), lighting loss (HL), Ranking Loss (RL) and Micro-F1(mic) under the data set Computer, respectively, and can be seen from the figure, in the case that the characteristic number is less than 70, the five indexes are similar to MDFS in performance and superior to other three algorithms; however, under the condition that the characteristic number is larger than 70, the five indexes of the method are better than the other four algorithms in performance.

In order to further analyze the experimental results, tables 2 to 6 respectively show the specific numerical values of the present invention and the existing four algorithms under the indicators RL, HL, CV, AP and mic.

TABLE 2

TABLE 3

TABLE 4

TABLE 5

TABLE 6

The data shown in bold in tables 2-6 are the optimal values for the row, and it is clear from the tables that the performance of each index is optimal compared to the other four multi-marker feature selection algorithms.

The experimental result further shows that the method can select the feature subset with smaller scale and stronger classification capability for the text data set, and has certain advantages compared with the conventional multi-label feature selection algorithm.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种新材料行业基于标识和语义的信息管理方法

Multi-label text data feature selection method and device

相关技术

网友询问留言