Power dispatching monitoring data anomaly detection method based on feature correlation partition regression

文档序号:191183 发布日期:2021-11-02 浏览:40次 中文

阅读说明:本技术 一种基于特征相关性分区回归的电力调度监控数据异常检测方法 (Power dispatching monitoring data anomaly detection method based on feature correlation partition regression ) 是由 高欣 刘治宇 李康生 贾欣 薛冰 傅世元 黄旭 黄子健 于 2021-08-23 设计创作,主要内容包括:本发明实施例提出了一种基于特征相关性分区回归的电力调度监控数据异常检测方法,包括:将电力调度监控历史数据划分为训练集和测试集,基于皮尔逊相关系数计算训练集特征间的相关系数矩阵;根据计算所得相关系数矩阵对训练集进行特征子空间的划分;根据特征子空间内特征相关程度的高低选择特征作为伪标签,剩余特征作为预测属性,基于支持向量回归SVR训练用于预测伪标签的回归模型;对测试集进行与训练集相同的特征子空间划分,并使用对应的回归模型计算各特征子空间中测试集样本的异常程度;根据特征子空间内相关程度计算所对应的权重;根据加权后集成的最终异常分数获得测试集样本的检测结果。(The embodiment of the invention provides a power dispatching monitoring data anomaly detection method based on characteristic correlation partition regression, which comprises the following steps: dividing power dispatching monitoring historical data into a training set and a testing set, and calculating a correlation coefficient matrix among training set characteristics based on a Pearson correlation coefficient; dividing the feature subspace of the training set according to the calculated correlation coefficient matrix; selecting features as pseudo labels according to the degree of feature correlation in the feature subspace, using the residual features as prediction attributes, and training a regression model for predicting the pseudo labels based on Support Vector Regression (SVR); dividing the test set into feature subspaces which are the same as those of the training set, and calculating the abnormal degree of the test set samples in each feature subspace by using a corresponding regression model; calculating corresponding weight according to the correlation degree in the feature subspace; and obtaining the detection result of the test set sample according to the weighted integrated final abnormal score.)

1. A power dispatching monitoring data abnormity detection method based on feature correlation partition regression is characterized by comprising the following steps:

(1) calculating the correlation among the features, specifically:

randomly selecting part of historical data in all power monitoring historical data as a training set S, and using the rest historical data as a test set T; the power dispatching monitoring historical data are process real-time resource occupation data which are collected by a power dispatching monitoring system and are related to power dispatching system services, and the characteristic attributes of the historical data comprise process CPU occupancy rate, memory occupancy rate, disk IO, network IO, thread number and network connection number; the characteristic dimensionality of the samples in the data set is N, and a corresponding Pearson correlation coefficient matrix C is calculated based on the sample characteristics of the training set S:

wherein x isa,xbRespectively the values of samples in the training set S under the characteristic attributes of the a and b dimensions, rhoabIs xa,xbA, b ∈ 1, 2., N and a ≠ b; cov (x)a,xb) Is xaAnd xbThe covariance between, Var () is the respective corresponding variance, ρab=ρba

(2) Dividing a feature subspace, specifically comprising:

determining the number k of feature subspaces needing to be divided according to the feature dimension N of the samples in the training set S, and setting the ith feature subspace SiContaining a number of features niAnd then:

k=int(α×N)+1

wherein, i is 1,2, …, k, int () is rounded down, and the partition coefficient α of the feature subspace is 0.2, so as to control the height of the feature dimension in the feature subspace;

setting S' as a feature space, and setting an initial value of the feature space as a whole training set S; in each cycle, obtaining the feature with the highest relational number in the feature space S' according to the Pearson correlation coefficient matrix C calculated in the step (1)And extracting and mixingRelevance ranked top j-dimensional features Andall vectors are l multiplied by 1, and all the vectors contain all samples of the training set S under the corresponding dimension characteristic attributeL is the number of samples in the training set S; order toX is to beiAndas a constituent ith feature subspace SiAnd updating the feature space S '← S' -SiContinuing the circulation, wherein the circulation termination condition is that S' is equal to phi, and phi represents an empty set; completing the division of all k characteristic subspaces;

(3) training a feature subspace regression model, specifically:

the characteristic subspace S obtained in the step (2) isiIn (1) correspond toAs a pseudo tag, remaining featuresAs a prediction attribute, and based on Support Vector Regression (SVR) algorithm training for predicting pseudo labelsRegression model fi(xi) The general form of the model obtained after training is as follows:

fi(xi)=wTxi+b

wherein w, b are parameters obtained by model training, w ═ w1,w2,…,wj]B is a constant term;

(4) calculating the abnormal degree of the test set samples in each characteristic subspace by using the trained regression model, specifically comprising the following steps:

performing the same characteristic subspace division on the characteristic attributes of the samples in the test set T according to the characteristic subspace division result of the training set S in the step (2);andare all vectors of l' × 1,to test the predicted properties of the samples in the set T in the ith dimension feature subspace,the attribute is the corresponding pseudo label attribute, and l' is the number of samples in the test set T;

according to the model f obtained by training in the step (3)i(x) Predicting samples in the test set T after the characteristic subspace is divided, and enabling the samples to be predictedAvailable pseudo labelCorresponding predicted value fi(xi') to obtain a difference vector between the true value and the predicted value of the pseudo label

The difference calculation results of the samples in each test set T can be found inAre in one-to-one correspondence; will be provided withAs the ith feature subspace inner test set sampleThe abnormal score of the book, the test set sample isThe larger the corresponding value in (2), the higher the degree of abnormality of the sample is considered; integrating the results of the feature subspaces to obtain an abnormal score set

(5) Weighting each feature subspace result, specifically:

calculating the weight corresponding to each feature subspace according to the feature subspace correlation degree obtained in the step (2) to form a one-dimensional weight vectorIth feature subspace SiCorresponding weight psiiIs composed ofThe maximum value in the correlation coefficient vector of (a), namely:

wherein the content of the first and second substances,miis characterized in thatPosition in the pearson correlation coefficient matrix C; argmax () is taken as the maximum value; weight psiiThe higher the correlation degree between the attributes in the ith characteristic subspace is, the better the performance of the trained model is, and the more reliable the abnormal score obtained by calculation is;

(6) according to the abnormal scores corresponding to the feature subspaces obtained in the steps (4) and (5)And the weight psiiAnd integrating to obtain a final abnormal score, and obtaining a detection result of the sample in the test set T according to the final abnormal score to realize the abnormal detection of the power dispatching monitoring data.

2. The method for detecting the abnormality of the power scheduling monitoring data based on the feature correlation partition regression as claimed in claim 1, wherein in the step (1), 80% of the historical power scheduling monitoring data is used as a training set S, and 20% is used as a test set T.

3. The method for detecting the abnormality of the power dispatching monitoring data based on the feature correlation partition regression as claimed in claim 1, wherein in the step (6), the abnormality scores corresponding to the feature subspaces obtained in the steps (4) and (5) are obtainedAnd the weight psiiAnd integrating to obtain a final abnormal score, and obtaining a detection result of the sample in the test set T according to the final abnormal score to realize the abnormal detection of the power dispatching monitoring data, which specifically comprises the following steps:

based on the obtained weight vectorAnd calculating a final abnormal score by the abnormal score set r

Where, the product between the matrices is represented,

the samples in the test set T are classified as the final abnormal scoreThe corresponding values in the test set are ranked from high to low, the top t% of samples in the sorted test set are marked as abnormal types, and t is more than or equal to 5 and less than or equal to 10, so that the abnormal detection of the power dispatching monitoring data is realized.

[ technical field ] A method for producing a semiconductor device

The invention relates to a power dispatching monitoring data anomaly detection method, in particular to a power dispatching monitoring data anomaly detection method based on characteristic correlation partition regression.

[ background of the invention ]

The smart grid is a novel grid formed by highly integrating modern advanced sensing measurement technology, communication technology, information technology, computer technology and control technology with a physical grid on the basis of the physical grid, and comprises the links of power generation, power transmission, power transformation, power distribution, power utilization and scheduling. The intelligent power grid dispatching control center is used as a command center for power grid operation control, the stability of the intelligent power grid dispatching control center directly influences the stability of the provided service, and the breakdown of a control center system can cause great loss to managers and users. And the artificial intelligence is widely applied in the field of electric power systems, so that the working efficiency of the electric power systems can be effectively improved, and the safety of the electric power systems in the operation process can be ensured. Because the monitoring system can generate a large amount of monitoring data in a short time when the power grid runs, it is difficult to manually calibrate the positive and abnormal labels for the data in a way of consulting experts and the like. Therefore, these stored historical grid dispatching monitoring data often lack accurate tag information. Meanwhile, due to the robustness of the power grid system, the quantity of abnormal data which can be collected by the monitoring system is far less than that of normal data. Therefore, unsupervised anomaly detection methods that do not require data tags are becoming an important approach to solve the problems in this field. Typically, anomalies are considered to be data points that occur in regions where the data set is sparsely distributed and far from neighboring points, for which most unsupervised algorithms tend to mine the characteristics of the data set distribution and rely on differences in density or distance measures of the distribution of the data set samples in space to distinguish between positive anomalous data. Although the method is simple and quick, the power grid data has the characteristics of high data characteristic dimension and difficulty in distinguishing irrelevant attributes, and in the case of the method, the conventional unsupervised anomaly detection method based on sample distribution is easily influenced, so that the anomaly detection performance of the model is reduced. Therefore, by aiming at the characteristics of the data of the power dispatching monitoring system, the anomaly detection method capable of effectively improving the detection accuracy of the anomaly data when no data label exists is considered, and the method has important significance for strengthening the monitoring of the power grid state and guaranteeing the safety of the power grid.

[ summary of the invention ]

In view of this, the invention provides a power dispatching monitoring data anomaly detection method based on feature correlation partition regression, so as to improve the performance of power dispatching monitoring data anomaly detection.

The invention provides a power dispatching monitoring data anomaly detection method based on feature correlation partition regression, which comprises the following steps:

(1) calculating the correlation among the features, specifically:

randomly selecting part of historical data in all power monitoring historical data as a training set S, and using the rest historical data as a test set T; the power dispatching monitoring historical data are process real-time resource occupation data which are collected by a power dispatching monitoring system and are related to power dispatching system services, and the characteristic attributes of the historical data comprise process CPU occupancy rate, memory occupancy rate, disk IO, network IO, thread number and network connection number; the characteristic dimensionality of the samples in the data set is N, and a corresponding Pearson correlation coefficient matrix C is calculated based on the sample characteristics of the training set S:

wherein x isa,xbRespectively the values of samples in the training set S under the characteristic attributes of the a and b dimensions, rhoabIs xa,xbA, b ∈ 1, 2., N and a ≠ b; cov (x)a,xb) Is xaAnd xbThe covariance between, Var () is the respective corresponding variance, ρab=ρba

(2) Dividing a feature subspace, specifically comprising:

determining the number k of feature subspaces needing to be divided according to the feature dimension N of the samples in the training set S, and setting the ith feature subspace SiContaining a number of features niAnd then:

k=int(α×N)+1

wherein, i is 1,2, …, k, int () is rounded down, and the partition coefficient α of the feature subspace is 0.2, so as to control the height of the feature dimension in the feature subspace;

setting S' as a feature space, and setting an initial value of the feature space as a whole training set S; in each cycle, obtaining the feature with the highest relational number in the feature space S' according to the Pearson correlation coefficient matrix C calculated in the step (1)And extracting and mixingRelevance ranked top j-dimensional featuresj=ni-1;Andvectors are l multiplied by 1, and all the vectors contain values of all samples in the training set S under corresponding dimension characteristic attributes, wherein l is the number of the samples in the training set S; order toX is to beiAndas a constituent ith feature subspace SiAnd updating the feature space S '← S' -SiContinuing the circulation, wherein the circulation termination condition is that S' is equal to phi, and phi represents an empty set; completing the division of all k characteristic subspaces;

(3) training a feature subspace regression model, specifically:

the characteristic subspace S obtained in the step (2) isiIn (1) correspond toAs a pseudo tag, remaining featuresAs a prediction attribute, and based on Support Vector Regression (SVR) algorithm training for predicting pseudo labelsRegression model fi(xi) The general form of the model obtained after training is as follows:

fi(xi)=wTxi+b

wherein w, b are parameters obtained by model training, w ═ w1,w2,…,wj]B is a constant term;

(4) calculating the abnormal degree of the test set samples in each characteristic subspace by using the trained regression model, specifically comprising the following steps:

according to the characteristic subspace division result of the training set S in the step (2), carrying out comparison on the samples in the test set TThe same characteristic subspace division is carried out on the characteristic attributes;andare all vectors of l' × 1,to test the predicted properties of the samples in the set T in the ith dimension feature subspace,the attribute is the corresponding pseudo label attribute, and l' is the number of samples in the test set T;

according to the model f obtained by training in the step (3)i(x) Predicting samples in the test set T after the characteristic subspace is divided, and enabling the samples to be predictedAvailable pseudo labelCorresponding predicted value fi(xi') to obtain a difference vector between the true value and the predicted value of the pseudo label

The difference calculation results of the samples in each test set T can be found inAre in one-to-one correspondence; will be provided withException as test set sample in ith feature subspaceFraction, test set sample inThe larger the corresponding value in (2), the higher the degree of abnormality of the sample is considered; integrating the results of the feature subspaces to obtain an abnormal score set

(5) Weighting each feature subspace result, specifically:

calculating the weight corresponding to each feature subspace according to the feature subspace correlation degree obtained in the step (2) to form a one-dimensional weight vectorIth feature subspace SiCorresponding weight psiiIs composed ofThe maximum value in the correlation coefficient vector of (a), namely:

wherein the content of the first and second substances,Ci∈C,miis characterized in thatPosition in the pearson correlation coefficient matrix C; argmax () is taken as the maximum value; weight psiiThe higher the correlation degree between the attributes in the ith characteristic subspace is, the better the performance of the trained model is, and the more reliable the abnormal score obtained by calculation is;

(6) according to the abnormal scores r corresponding to the feature subspaces obtained in the steps (4) and (5)iAnd the weight psiiIntegrating to obtain a final abnormal score, and obtaining a detection result of the sample in the test set T according to the final abnormal score to realizeDetecting the abnormity of the power dispatching monitoring data;

in the step (1), 80% of historical data of power dispatching monitoring is used as a training set S, and 20% of historical data of power dispatching monitoring is used as a test set T;

in the step (6), the abnormal score r corresponding to the feature subspace obtained in the steps (4) and (5) is obtainediAnd the weight psiiAnd integrating to obtain a final abnormal score, and obtaining a detection result of the sample in the test set T according to the final abnormal score to realize the abnormal detection of the power dispatching monitoring data, which specifically comprises the following steps:

based on the obtained weight vectorAnd calculating a final abnormal score by the abnormal score set r

Where, the product between the matrices is represented,

the samples in the test set T are classified as the final abnormal scoreThe corresponding values in the test set are ranked from high to low, the top t% of samples in the sorted test set are marked as abnormal types, and t is more than or equal to 5 and less than or equal to 10, so that the abnormal detection of the power dispatching monitoring data is realized.

The power dispatching monitoring data anomaly detection method improves the anomaly detection accuracy of the power dispatching monitoring data.

According to the technical scheme, the invention has the following beneficial effects:

in the technical scheme implemented by the invention, based on different correlations among characteristic attributes, characteristics are selected as pseudo labels instead of real labels of data, and the characteristics of a data set are divided according to related information so as to use strong correlation characteristics to carry out regression prediction and mine modes contained among the characteristics; meanwhile, the reliability of feature prediction under different correlation degrees is considered, the correlation coefficient is introduced to serve as the weight of the prediction result of each partition, the problem of performance degradation caused by increase of dimension is relieved to a certain extent, the influence of irrelevant attributes is reduced, and therefore the performance of power dispatching monitoring data abnormity detection is improved.

[ description of the drawings ]

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a schematic diagram of a frame flow of a power scheduling monitoring data anomaly detection method based on feature correlation partition regression according to the present invention;

FIG. 2 is a flow chart diagram of a partition method based on feature correlation;

FIG. 3 is a schematic flow chart of a weighted regression prediction method based on feature correlation;

FIG. 4 is a schematic diagram of an abnormal detection method for power dispatching monitoring data based on feature correlation partition regression according to the present invention;

FIG. 5 is a schematic of the input data and output results of the algorithm of the present invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings.

It should be understood that the described embodiments of the invention are only some, but not all embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a power dispatching monitoring data anomaly detection method based on feature correlation partition regression. In order to meet the requirement of abnormal detection of power dispatching monitoring data, a feature subspace with higher correlation degree is divided according to the correlation among the features and is used for training a regression model to detect the data to be detected.

Fig. 1 is a schematic frame flow diagram of a power scheduling monitoring data anomaly detection method based on feature correlation partition regression, which includes the following steps:

step 101, dividing power dispatching monitoring historical data into a training set and a testing set, and calculating a correlation coefficient matrix among training set characteristics based on a Pearson correlation coefficient.

Specifically, 80% of historical data in all power monitoring historical data is randomly selected as a training set S, and 20% of historical data is selected as a test set T. The power dispatching monitoring historical data is process real-time resource occupation data which is collected by a power dispatching monitoring system and is related to power dispatching system services, and the characteristic attributes of the historical data include process CPU occupancy rate, memory occupancy rate, disk IO, network IO, thread number and network connection number. The characteristic dimensionality of the samples in the data set is N, and a corresponding Pearson correlation coefficient matrix C is calculated based on the sample characteristics of the training set S:

wherein x isa,xbRespectively the values of samples in the training set S under the characteristic attributes of the a and b dimensions, rhoabIs xa,xbThe correlation coefficient between a, b ∈ 1, 2. Cov (x)a,xb) Is xaAnd xbThe covariance between, Var () is the respective corresponding variance, ρab=ρba

And 102, dividing the feature subspace of the training set according to the calculated correlation coefficient matrix.

Specifically, according to the feature dimension N of the samples in the training set S, the number k of the feature subspaces needing to be divided is determined, and the ith feature subspace S is setiContaining a number of features niAnd then:

k=int(α×N)+1

where i is 1,2, …, k, int () is rounded down, and the partition coefficient α of the feature subspace is 0.2, so as to control the height of the feature dimension in the feature subspace.

Let S' be a feature space whose initial value is the entire training set S. In each cycle, the feature with the highest relational number in the feature space S' is obtained from the pearson correlation coefficient matrix C calculated in step 101And extracting and mixingRelevance ranked top j-dimensional featuresj=ni-1。Andthe vectors are all l × 1 vectors, and all include values of all samples in the training set S under the corresponding dimension characteristic attribute, and l is the number of samples in the training set S. Order toX is to beiAndas a constituent ith feature subspace SiAnd updating the feature space S '← S' -SiAnd then continuing the circulation, wherein the circulation end condition is that S' is equal to phi, and phi represents an empty set. The division of the total k feature subspaces is completed.

And 103, selecting the features as pseudo labels according to the degree of the feature correlation in the feature subspace, using the residual features as prediction attributes, and training a regression model for predicting the pseudo labels based on Support Vector Regression (SVR).

Specifically, the feature subspace S obtained in step 102 is usediIn (1) correspond toAs a pseudo tag, remaining featuresAs a prediction attribute, and based on Support Vector Regression (SVR) algorithm training for predicting pseudo labelsRegression model fi(xi) The general form of the model obtained after training is as follows:

fi(xi)=wTxi+b

wherein w, b are parameters obtained by model training, w ═ w1,w2,…,wj]And b is a constant term.

And 104, dividing the feature subspace of the test set, which is the same as that of the training set, and calculating the abnormal degree of the test set sample in each feature subspace by using a corresponding regression model.

Specifically, the feature attributes of the samples in the test set T are divided into the same feature subspace according to the feature subspace division result of the training set S in step 102.Andare all vectors of l' × 1,to test the predicted properties of the samples in the set T in the ith dimension feature subspace,then is the corresponding pseudo label attribute and l' is the number of samples in the test set T.

According to the model f obtained by training in step 103i(x) Predicting samples in the test set T after the characteristic subspace is divided, and enabling the samples to be predictedAvailable pseudo labelCorresponding predicted value fi(xi') to obtain a difference vector between the true value and the predicted value of the pseudo label

The difference calculation results of the samples in each test set T can be found inOne to one. Will be provided withAs an anomaly score for the test set samples in the ith feature subspaceThe larger the corresponding value in (2), the higher the degree of abnormality of the sample is considered.Integrating the results of the feature subspaces to obtain an abnormal score set

And 105, calculating corresponding weight according to the correlation degree in the feature subspace.

Specifically, the weights corresponding to the feature subspaces are calculated according to the correlation degree of the feature subspaces obtained in step 102 to form a one-dimensional weight vectorIth feature subspace SiCorresponding weight psiiIs composed ofThe maximum value in the correlation coefficient vector of (a), namely:

wherein the content of the first and second substances,Ci∈C,miis characterized in thatPosition in the pearson correlation coefficient matrix C. argmax () takes the maximum value. Weight psiiThe higher the correlation degree between the attributes in the ith feature subspace is, the better the performance of the trained model is, and the more reliable the calculated abnormal score is.

And step 106, taking the weighted integrated final abnormal score as the detection result of the test set sample.

Specifically, the abnormal score r corresponding to the feature subspace obtained in the steps 104 and 105iAnd the weight psiiIntegrating to obtain a final abnormal score, and obtaining a detection result of the sample in the test set T according to the final abnormal score to realize abnormal detection of the power dispatching monitoring data, specificallyComprises the following steps:

based on the obtained weight vectorAnd calculating a final abnormal score by the abnormal score set r

Where, the product between the matrices is represented,

the samples in the test set T are classified as the final abnormal scoreThe corresponding values in the test set are ranked from high to low, the top t% of samples in the sorted test set are marked as abnormal types, and t is more than or equal to 5 and less than or equal to 10, so that the abnormal detection of the power dispatching monitoring data is realized.

FIG. 2 is a schematic flow chart of a feature correlation-based partitioning method for partitioning the entire feature space based on the correlation coefficients among the features of the training set; inputting all data in a training set; after the illustrated circulation process, the features of each dimension of all the samples in the training set are divided into corresponding feature subspaces.

FIG. 3 is a schematic flow chart of a weighted regression prediction method based on feature correlation, which is used for predicting each feature subspace SiIn (1) correspond toAs a pseudo label, the residual features are used as prediction attributes, and a regression model is trained based on a Support Vector Regression (SVR) algorithm; after training is finished, the model obtained by training is used for predicting samples in the test set T after the characteristic subspace is divided, and the pseudo labels of the test set T can be obtainedThe corresponding predicted value is further used for obtaining a difference value calculation result of each test set sample; if the calculated difference is larger, the abnormal degree of the sample is considered to be higher; and calculating the weight corresponding to each feature subspace according to the obtained feature subspace correlation degree, and finally integrating all results according to the corresponding abnormal scores and weights of the obtained samples in each feature subspace.

Fig. 4 is a schematic diagram of an abnormal detection method for power scheduling monitoring data based on feature correlation partition regression, which mainly includes 6 stages: the method comprises a step of calculating correlation among features, a step of dividing feature subspaces, a step of training a regression model, a step of calculating an abnormal score, a step of calculating a weight of the feature subspaces, and a step of integrating to obtain a final abnormal score and obtain a result. In the stage of calculating the correlation among the features, 80% of historical monitoring data of the power dispatching is randomly used as a training set, 20% of historical monitoring data of the power dispatching is randomly used as a testing set, and a correlation coefficient matrix is calculated based on the sample features of the training set. In the stage of dividing the feature subspace, the whole feature space is partitioned based on the correlation coefficient among the features of the training set, all data in the training set are input, and the features of all samples in the training set are divided into the feature subspaces. In the stage of training the regression model, a certain one-dimensional feature is selected from each feature subspace as a pseudo label according to the correlation coefficient, the residual features are used as prediction attributes, and the regression model is trained on the basis of the support vector regression SVR algorithm. In the stage of calculating the abnormal score, the test set is divided equally according to the characteristic subspace division condition of the training set, and the difference value between the predicted value and the true value of the test set sample on each regression model is calculated, so that the corresponding abnormal score of each sample of the test set in each subspace is obtained. In the stage of calculating the weight of the feature subspace, different weights are given to the abnormal scores corresponding to the test set based on the highest correlation coefficient in the feature subspace of the divided training set, the higher the weight is, the higher the correlation degree between the attributes in the feature subspace is considered to be, the better the performance of the trained model is, and the more reliable the calculated abnormal score is. And in the stage of obtaining the final abnormal score and the result by integration, obtaining the final abnormal score by integration according to the abnormal score and the weight corresponding to the obtained feature subspace, and obtaining the detection result of the test concentrated sample according to the final abnormal score, thereby realizing the abnormal detection of the power dispatching monitoring data.

FIG. 5 is a schematic diagram of input data and output results of the algorithm of the present invention, the input of the algorithm of the present invention is process real-time resource occupation data related to the power scheduling system service collected by the power scheduling monitoring system, and its characteristic attributes include process CPU occupancy, memory occupancy, disk IO, network IO, thread number, and network connection number; the output of the algorithm is ranking according to the obtained abnormal scores, the first t% of input data is abnormal data, other data is normal data, and t is generally equal to or more than 5 and equal to or less than 10.

The algorithm 1 is a complete framework pseudo code of the power dispatching monitoring data anomaly detection method based on the characteristic correlation partition regression, and comprises the following steps:

for a specific embodiment, 21 public data sets were used for testing, the data sets from various domains and were preprocessed to simulate data features with a very small number of anomalies. Specific information of the data set is shown in table 1. To reduce the randomness of the results, all results are the average of 25 runs.

TABLE 1 data set used in the specific examples

Data set Total number of samples Specific constant Characteristic dimension Degree of unbalance
PenDigits 4934 10 15 493.4
Pop_failures 509 15 18 33.9
Hepatitis 70 3 19 22.3
Messidor_features 567 27 19 21.0
Cardiotocography 1681 33 20 50.9
Waveform 3443 100 20 34.4
Annthyroid 3365 67 20 50.2
Parkinson 50 2 21 25.0
mHealth 697 20 23 34.9
WDBC 367 10 30 36.7
WPBC 155 4 32 38.7
Biodeg 730 31 41 23.5
Spectf 218 7 44 31.1
Lymphography 148 6 46 24.7
Spam-Base 2579 51 56 50.6
Sonar 100 4 60 25.0
Green 225 9 62 25.0
MEU_Mobile 1070 50 71 21.4
KDDCup99 4811 20 78 240.6
Mice_Protein 519 12 79 43.3
Movement_libras 347 11 90 31.5

In order to verify the effectiveness of the proposed algorithm, the comparison algorithm is divided into two categories in the embodiment of the present invention. Three methods based on feature prediction are provided: DEMED, ALSO, and DELR; four methods based on sample distribution: LOF, KNN, COPOD, LGOD. Embodiments of the present invention are represented in the table by CFPR. The comparative algorithm-related parameters are shown in table 2.

TABLE 2 comparison of Algorithm-related parameters

The AUC index was used for the assessment in the examples of the present invention. Generally, the G-mean is used for evaluating the performance of the algorithm under data imbalance, and generally, an AUC index is more suitable for judging whether the unsupervised anomaly detection method is good or bad. The AUC is commonly used in the field of anomaly detection, because the obtained result is not influenced by category imbalance, and meanwhile, the AUC can be calculated only according to the ranking of the anomaly score, and the required AUC value can be calculated through the rank values of the positive and negative category samples in the ranking table. In the embodiment, the abnormal class is regarded as the positive class, and the size of the AUC directly indicates the performance of the algorithm on abnormal data, namely the larger the AUC is, the higher the accuracy of abnormal detection is, and the better the performance of the algorithm is.

The abnormality determination ratio t of the detection result in the embodiment of the invention is set to 5.

The AUC results on the published data set for the inventive examples and other comparative methods are shown in table 3. According to the power dispatching monitoring data anomaly detection method based on the characteristic correlation partition regression, AUC values exceeding those of other methods are obtained on most public data sets, and the highest average AUC is obtained.

TABLE 3 AUC results on public data set

The embodiment of the invention is also applied to three kinds of service exceptions of the intelligent power grid dispatching control system, namely data jumping, application disconnection and no refreshing of a telemetry table.

Table 4 shows the AUC results for the three anomalies for the inventive example and other comparative methods.

TABLE 4 AUC results over three abnormalities

Type of exception DEMUD ALSO DELR LOF KNN COPOD LGOD CFPR
Data hopping 0.8614 0.9994 0.9926 0.5417 0.4396 0.9800 0.2482 0.9852
Application cut-off net 0.9510 0.9955 0.9969 0.6981 0.9063 0.9923 0.9959 0.9868
Remote meter not refreshing 0.9848 0.9853 0.9928 0.5517 0.9927 0.9922 0.7628 0.9952

It can be seen from table 4 that the present invention achieves the optimal performance of the AUC indicators over the telemetry table without refreshing the anomaly. Because the power dispatching monitoring data anomaly detection method based on the characteristic correlation partition regression is used for mining hidden information in data by searching the correlation between data characteristics, the method has no outstanding performance on two anomalies of data jumping and application disconnection, but has no lag behind too much compared with other methods. The comparison results of the three types of actual power dispatching monitoring data on abnormity are combined with the comparison results of a large number of public data sets, so that the method can effectively improve the accuracy of abnormity detection under the conditions of higher dimensionality and more irrelevant attributes of the power dispatching monitoring data, and can obtain a more stable abnormity detection result under other conditions.

In summary, the embodiments of the present invention have the following beneficial effects:

in the technical scheme, power dispatching monitoring historical data are divided into a training set and a testing set, and a correlation coefficient matrix among training set features is calculated based on a Pearson correlation coefficient; dividing the feature subspace of the training set according to the calculated correlation coefficient matrix; selecting features as pseudo labels according to the degree of feature correlation in the feature subspace, using the residual features as prediction attributes, and training a regression model for predicting the pseudo labels based on Support Vector Regression (SVR); dividing the test set into feature subspaces which are the same as those of the training set, and calculating the abnormal degree of the test set samples in each feature subspace by using a corresponding regression model; calculating corresponding weight according to the correlation degree in the feature subspace; and taking the final abnormal score integrated after weighting as the detection result of the test set sample. Compared with other unsupervised algorithms, the method can obtain higher abnormality detection accuracy.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于时间序列分解的电力变压器数据清洗方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类