Voltage missing data identification method based on improved random forest algorithm

文档序号:68801 发布日期:2021-10-01 浏览:2次 中文

阅读说明:本技术 一种基于改进随机森林算法的电压缺失数据辨识方法 (Voltage missing data identification method based on improved random forest algorithm ) 是由 李绍坚 韦明超 罗淑芳 莫江婷 甘静 夏斌 王益成 周觅路 韦社敏 鲁林军 陈柏 于 2021-04-13 设计创作,主要内容包括:本发明公开了一种基于改进随机森林算法的电压缺失数据辨识方法,其中方法步骤包括:获取电网历史数据,选择缺失数据所有对应的关联属性,进行不同的属性划分;通过属性综合加权计算得到学习样本集合;对学习样本进行重复抽样,得到若干个相似样本集合;将所述相似样本集合作为输入,训练随机森林回归模型;提高随机森林回归预测精度;将所有决策树的最终预测均值作为填补结果,评估填补结果,填补结果在容忍范围则填补完成。通过本发明提高对缺失数据的辨识精度,从而提高了电网缺失值的填补精度。(The invention discloses a voltage missing data identification method based on an improved random forest algorithm, wherein the method comprises the following steps: acquiring historical data of a power grid, selecting all corresponding associated attributes of missing data, and dividing different attributes; obtaining a learning sample set through attribute comprehensive weighting calculation; repeatedly sampling the learning samples to obtain a plurality of similar sample sets; taking the similar sample set as input, and training a random forest regression model; improving the prediction precision of random forest regression; and taking the final prediction mean values of all the decision trees as filling results, evaluating the filling results, and completing filling when the filling results are in a tolerance range. The method and the device improve the identification precision of the missing data, thereby improving the filling precision of the power grid missing value.)

1. A voltage missing data identification method based on an improved random forest algorithm is characterized by comprising the following steps:

s1: acquiring historical data of a power grid, selecting all corresponding associated attributes of missing data, and dividing different attributes;

s2: obtaining a learning sample set through attribute comprehensive weighting calculation;

s3: repeatedly sampling the learning samples to obtain a plurality of similar sample sets;

s4: taking the similar sample set as input, and training a random forest regression model;

s5: the random forest regression prediction precision is improved by reducing the relevance among the decision trees and improving the precision of the decision trees;

s6: and taking the final prediction mean values of all the decision trees as filling results, evaluating the filling results, and completing filling when the filling results are in a tolerance range.

2. The improved random forest algorithm-based voltage missing data identification method as claimed in claim 1, wherein the attribute comprehensive weighting calculation comprises the following steps:

s21: calculating the cross correlation coefficient among the attributes of the correlation attributes, and storing the attributes of which the cross correlation coefficient is larger than a given threshold value into a cross correlation set HG;

s22: performing attribute error expectation calculation on the cross-correlation set HG, wherein the attribute error expectation is greater than a strong correlation threshold value, and storing the attribute error expectation into a strong correlation attribute set QX;

s23: and establishing weights among the attributes in the strong correlation attribute set QX by adopting an entropy weight method to obtain weight vectors, arranging a selection threshold value according to the ordering of the comprehensive attribute weighted values SX obtained according to the strong correlation coefficient from large to small, and selecting a sample larger than the selection threshold value as a learning sample set.

3. The improved random forest algorithm-based voltage missing data identification method as claimed in claim 2, wherein the cross correlation coefficient between the attributes of the associated attributes is calculated as follows,

when the Pearson coefficient is used for the population, as shown in equation (1):

where X, Y are two random variables of different attributes, σXYRespectively, standard deviations of X and Y, cov (X, Y) is a covariance, as shown in equation (2):

wherein n represents the number of samples;

when the pearson coefficient is used for the sample, as shown in equation (3):

wherein x isi,yiIs the observation point value of the variable X, Y corresponding to i,sample means corresponding to X and Y respectively;

and calculating the cross-correlation coefficient among the attributes through the Pearson coefficient, and selecting the attribute with the cross-correlation coefficient larger than a given threshold value to store in a cross-correlation set HG.

4. The improved random forest algorithm-based voltage missing data identification method as claimed in claim 2, wherein the expected calculation formula of the attribute errors of the cross-correlation set HG is as follows,

wherein, Cov (X)k,Yk) Is Xk,YkThe covariance of (a); var [ X ]k]Is XkThe variance of (a); var [ Y ]k]Is YkThe variance of (a);

if EXPERror (X)k,Yk) If beta is greater than beta, the attribute is strong correlation attribute and is stored in the strong correlation attribute set QX.

5. The improved random forest algorithm-based voltage missing data identification method as claimed in claim 2, wherein the attributes in the strongly correlated attribute set QX adopt an entropy weight method to establish the weights among the attributes, and the obtained weight vectors are as follows:

W=[w1,w2,...,wm] (5)

wherein m: the number of strongly associated attributes;

and (3) obtaining an attribute comprehensive weighted value SX according to the strong correlation coefficient:

SX=W1S1+W2S1+...+WmSm (6)

and (4) setting a selection threshold value according to the attribute comprehensive weighting results of the historical section data in a descending order, and selecting the sample with the larger threshold value as a learning sample set.

6. The improved random forest algorithm-based voltage missing data identification method as claimed in claim 1, wherein said performing different attribute partitions is based on a Gini index by determining all partitions at termination points, and the formula of GI is:

wherein, PjFor the occurrence frequency of j-type elements, U represents a data set, and m represents the number of categories;

the different attributes GI require to be divided, and the division of any attribute T can change U into U1And U2Then, GI formula 5 of the divided sample set U of the attribute T is shown:

for any attribute, the result of this partitioning can be such that the attribute generates the smallest GI subset as the split subset.

7. The method for identifying the voltage missing data based on the improved random forest algorithm as claimed in claim 1, wherein the random forest regression prediction precision is improved by reducing the relevance among the decision trees and improving the precision of the decision trees, and the specific steps are as follows:

step S51: all decision trees h (X, theta)k),k=1,...,NtreeThe set constitutes a random forest f, h (X, theta)k) Representing an unclipped decision tree; thetakIs a random vector independently and identically distributed with the kth decision tree; majority voting is adopted for the classification problem, and an arithmetic mean value is adopted for the regression problem to obtain a final predicted value of the random forest;

step S52: the confidence of the classification correctness is obtained through an edge function Q (X, Y), and the formula is as follows:

wherein, X: inputting vectors which maximally contain J different categories; y: the correct classification category of the output; j: represents one of J categories; i, indicating a function; a isk: an averaging function k is 1.., n;

step S53: as can be seen from equation (6), the larger the edge function is, the higher the confidence of the classification correctness is, and therefore the generalization error that can define the random forest regression is as shown in equation (7):

E*=SX,Y(Q(X,Y)<0) (7)

wherein S isX,Y: inputting a classification error rate function of the vector X;

step S54: for all sequences thetakIf the number of trees is increasing, E*Almost converge to:

wherein S isθ: by the set of the classification error rate of theta, the generalization of random forest regression can be converged to an upper bound by theorem, and the increase of trees can not cause overfitting to a prediction result;

step S35: and (3) the random forest regression generalization error upper bound is shown as the formula (9):

wherein, eta: average correlation coefficient of tree, ζ: average intensity of the tree.

Technical Field

The invention relates to the technical field of voltage value missing problems frequently occurring in data fusion of a power system, in particular to a voltage missing data identification method based on an improved random forest algorithm.

Background

With the rapid development of power grids, various systems increasingly depend on data requirements, however, in the data acquisition and transmission process, loss or abnormality of partial data is inevitably caused due to channel measurement, human factors and the like. Missing or abnormal data can affect the operation of the system and further data analysis, resulting in an abnormal output result.

Although the existing research has a good effect on filling of the missing data, the research and analysis on the correlation attribute of the missing value are less, the correlation attribute of the missing value has a great influence on the filling result, and the improved random forest algorithm based on the comprehensive weighting of the attributes identifies the missing data, improves the identification precision of the missing data and improves the filling precision of the missing value of the power grid.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a voltage missing data identification method based on an improved random forest algorithm, which realizes identification of missing data, improves identification precision of the missing data and improves filling precision of a power grid missing value.

In order to achieve the purpose, the invention provides a voltage missing data identification method based on an improved random forest algorithm, which comprises the following steps:

s1: acquiring historical data of a power grid, selecting all corresponding associated attributes of missing data, and dividing different attributes;

s2: obtaining a learning sample set through attribute comprehensive weighting calculation;

s3: repeatedly sampling the learning samples to obtain a plurality of similar sample sets;

s4: taking the similar sample set as input, and training a random forest regression model;

s5: the random forest regression prediction precision is improved by reducing the relevance among the decision trees and improving the precision of the decision trees;

s6: and taking the final prediction mean values of all the decision trees as filling results, evaluating the filling results, and completing filling when the filling results are in a tolerance range.

Calculating the cross correlation coefficient among the attributes of the correlation attributes, and storing the attributes of which the cross correlation coefficient is larger than a given threshold value into a cross correlation set HG;

the cross-correlation coefficient between the attributes of the associated attributes is calculated as follows,

when the Pearson coefficient is used for the population, as shown in equation (1):

x, Y are two random variables with different attributes, sigmaXYRespectively, standard deviations of X and Y, cov (X, Y) is a covariance, as shown in equation (2):

n represents the number of samples.

When the pearson coefficient is used for the sample, as shown in equation (3):

xi,yiis the observation point value of the variable X, Y corresponding to i,sample means corresponding to X and Y respectively;

and calculating the cross-correlation coefficient among the attributes through the Pearson coefficient, and selecting the attribute with the cross-correlation coefficient larger than the given threshold value to store in a cross-correlation set HG.

Performing attribute error expectation calculation on the cross-correlation set HG, wherein the attribute error expectation is greater than a strong correlation threshold value, and storing the attribute error expectation into a strong correlation attribute set QX;

the expected calculation formula of the attribute error of the cross-correlation set HG is as follows,

Cov(Xk,Yk) Is Xk,YkThe covariance of (a); var [ X ]k]Is XkThe variance of (a); var [ Y ]k]Is YkThe variance of (a);

if EXPERror (X)k,Yk) If beta is greater than beta, the attribute is strong correlation attribute and is stored in the strong correlation attribute set QX.

And establishing weights among the attributes in the strong correlation attribute set QX by adopting an entropy weight method to obtain a weight vector, arranging a selection threshold value according to the ordering of the comprehensive attribute weighted values SX obtained according to the strong correlation coefficient from large to small, and selecting a sample larger than the selection threshold value as a learning sample set.

The weight between the attributes of each attribute in the strong correlation attribute set QX is determined by adopting an entropy weight method, and a weight vector is obtained as follows:

W=[w1,w2,...,wm] (5)

and m is the number of strong association attributes.

And (3) obtaining an attribute comprehensive weighted value SX according to the strong correlation coefficient:

SX=W1S1+W2S1+...+WmSm (6)

and setting a selection threshold value according to the attribute comprehensive weighting results of the historical section data in a descending order, and selecting a sample with a larger threshold value as a learning sample set.

The growth of the whole decision tree is completed by performing different attribute divisions, and all divisions at the termination point are judged based on a Gini Index (GI), wherein the formula of the GI is as follows:

in the formula, PjFor the frequency of occurrence of the j-type elements, U represents the dataset and m represents the number of categories.

The different attributes GI require to be divided, and the division of any attribute T can change U into U1And U2Then, GI formula 5 of the divided sample set U of the attribute T is shown:

for any attribute, the result of this partitioning can be such that the attribute generates the smallest GI subset as the split subset. If GI on attribute TU,TThe smaller the size, the better the partitioning effect on the attribute T can be considered.

The method for improving the prediction precision of the random forest regression is to reduce the relevance among decision trees and improve the precision of the decision trees, and comprises the following steps:

all decision trees h (X, theta)k),k=1,...,NtreeThe set constitutes a random forest f, h (X, theta)k) Representing an unclipped decision tree; thetakIs a random vector independently and identically distributed with the kth decision tree; and adopting majority voting for the classification problem and adopting an arithmetic mean value for the regression problem to obtain a final predicted value of the random forest.

The confidence of the classification correctness is obtained through an edge function Q (X, Y), and the formula is as follows:

wherein X: inputting vectors which maximally contain J different categories; y: the correct classification category of the output; j: represents one of J categories; i, indicating a function; a isk: an averaging function k is 1.., n;

as can be seen from equation (6), the larger the edge function is, the higher the confidence of the classification correctness is, and therefore the generalization error that can define the random forest regression is as shown in equation (7):

E*=SX,Y(Q(X,Y)<0) (7)

in the formula SX,YIs a classification error rate function of the input vector X. The following theorem can be obtained by applying the law of large numbers to equation (7):

for all sequences thetakIf the number of trees is increasing, E*Almost converge to:

in the formula SθFor the classification error rate of the set theta, it can be seen from the theorem that the generalization of the random forest regression converges to an upper boundThe addition of the tree does not overfit the prediction.

And (3) the random forest regression generalization error upper bound is shown as the formula (9):

in the formula, eta: average correlation coefficient of tree, ζ: average intensity of the tree.

With the reduction of eta and the increase of zeta, the generalization error upper bound of the random forest is further reduced, and the error control is more facilitated.

And repeatedly sampling the learning sample set to obtain a plurality of similar sample sets.

And taking the similar sample set as input, and training a random forest regression model.

And (4) extracting subsets with the same size from the initial set for training any decision tree to generate K decision trees, and training a random forest.

The random forest regression prediction precision is improved by reducing the relevance among the decision trees and improving the precision of the decision trees.

And judging and classifying the trained random forest, taking the prediction mean values of all the trees as filling results, evaluating the filling results, and completing filling if the filling results are in a tolerance range.

The invention has the beneficial effects that: and (3) researching and analyzing the correlation attributes of the data missing value attributes based on an attribute comprehensive weighting improved random forest algorithm, screening to obtain the most similar correlation attributes of the filling data, and improving the identification precision of the missing data so as to improve the filling precision of the grid missing value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative work.

FIG. 1 is a schematic diagram of an improved random forest algorithm based on attribute comprehensive weighting in the implementation of the present invention;

FIG. 2 is a root mean square error plot of the filling results of the different algorithms of the present invention;

FIG. 3 is a graph of the accuracy of the filling results of the different algorithms of the present invention;

FIG. 4 is a comparison graph of the filling result of the improved forest algorithm and the real value.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of an improved random forest algorithm based on attribute synthesis weighting in the implementation of the present invention.

As shown in fig. 1, the improved random forest algorithm based on attribute comprehensive weighting includes:

step 1: and acquiring historical data of the power grid, and performing different attribute division from all corresponding associated attributes of the selected missing data.

Step 2: the growth of the whole decision tree is completed by performing different attribute divisions, and all divisions at the termination point are judged based on a Gini Index (GI), wherein the formula of the GI is as follows:

in the formula, PjFor the frequency of occurrence of the j-type elements, U represents the dataset and m represents the number of categories.

The different attributes GI require to be divided, and the division of any attribute T can change U into U1And U2Then, GI formula 2 of the divided sample set U of the attribute T is shown:

for any attribute, the result of this partitioning can be such that the attribute generates the smallest GI subset as the split subset. If GI on attribute TU,TThe smaller the size, the better the partitioning effect on the attribute T can be considered.

And step 3: calculating cross-correlation coefficients among the attributes through Pearson coefficients, and selecting the attributes with the cross-correlation coefficients larger than a given threshold value to be stored into a cross-correlation set HG;

the cross-correlation coefficient between the attributes of the associated attributes is calculated as follows,

when the Pearson coefficient is used for the population, as shown in equation (3):

where X, Y are two random variables of different attributes, σXYRespectively, standard deviations of X and Y, cov (X, Y) is a covariance, as shown in equation (4):

wherein n represents the number of samples;

when the pearson coefficient is used for the sample, as shown in equation (5):

wherein x isi,yiIs the observation point value of the variable X, Y corresponding to i,sample means corresponding to X and Y respectively;

and calculating the cross-correlation coefficient among the attributes through the Pearson coefficient, and selecting the attribute with the cross-correlation coefficient larger than the given threshold value to store in a cross-correlation set HG.

And 4, step 4: further calculating the error expectation EXPERror (X) of all attributes in the cross-correlation set HG setk,Yk)

Cov(Xk,Yk) Is Xk,YkThe covariance of (a); var [ X ]k]Is XkThe variance of (a); var [ Y ]k]Is YkThe variance of (a);

and 5: if EXPERror (X)k,Yk) If beta is more than beta (beta is a strong correlation threshold value), the strong correlation attribute is obtained, and the strong correlation attribute is retained in a strong correlation attribute set QX, if EXPERror (X) is obtainedk,Yk)<Beta, then returning to the step 4.

Step 6: and (3) adopting an entropy weight method to determine the weight among the attributes of each attribute in the set QX, and obtaining a weight vector as follows:

W=[w1,w2,...,wm] (7)

and m is the number of strong association attributes.

And 7: and (3) obtaining an attribute comprehensive weighted value SX according to the strong correlation coefficient:

SX=W1S1+W2S1+...+WmSm (8)

and (4) according to the attribute comprehensive weighting results of the historical section data, sorting the attribute comprehensive weighting results from large to small, setting a selection threshold value, and selecting a sample with a large threshold value as a learning sample set.

And 8: further, the learning sample set is repeatedly sampled to obtain a plurality of similar sample sets.

And step 9: and taking the similar sample set as input, and training a random forest regression model.

Step 10: the random forest regression prediction precision is improved by reducing the relevance among the decision trees and improving the precision of the decision trees, and the steps are as follows:

all decision trees h (X, theta)k),k=1,...,NtreeThe set constitutes a random forest f, h (X, theta)k) Representing an unclipped decision tree; thetakIs a random vector independently and identically distributed with the kth decision tree; majority voting is adopted for the classification problem, and an arithmetic mean value is adopted for the regression problem to obtain a final predicted value of the random forest;

the confidence of the classification correctness is obtained through an edge function Q (X, Y), and the formula is as follows:

wherein, X: inputting vectors which maximally contain J different categories; y: the correct classification category of the output; j: represents one of J categories; i, indicating a function; a isk: an averaging function k is 1.., n;

as can be seen from equation (9), the larger the edge function is, the higher the confidence of the classification correctness is, and therefore the generalization error that can define the random forest regression is shown in equation (8):

E*=SX,Y(Q(X,Y)<0) (10)

wherein S isX,Y: inputting a classification error rate function of the vector X;

for all sequences thetakIf the number of trees is increasing, E*Almost converge to:

wherein S isθ: by the set of the classification error rate of theta, the generalization of random forest regression can be converged to an upper bound by theorem, and the increase of trees can not cause overfitting to a prediction result;

the random forest regression generalization error upper bound is shown as the formula (11):

wherein, eta: average correlation coefficient of tree, ζ: average intensity of the tree.

Along with the reduction of eta and the increase of zeta, the generalization error upper bound of random forest regression can be further reduced, and the error control is more facilitated. Therefore, the method for improving the forest regression prediction precision of the data comprises the following steps: 1. reducing the relevance among trees; 2. the accuracy of a single decision tree is improved.

Step 11: and taking the final prediction mean values of all the decision trees as filling results, evaluating the filling results, and completing filling when the filling results are in a tolerance range.

The data comparison analysis of the voltage loss data identification method based on the improved random forest algorithm is as follows:

selecting and constructing a plurality of data sets from the big data of the power grid, selecting missing attributes according to conditions, and constructing missing data sets with the missing rates of 1%, 3%, 5%, 10%, 15%, 20%, 25% and 30% by a random deletion method. And respectively applying the improved random forest algorithm, the random forest algorithm and the in-situ algorithm to carry out experiments under different deficiency rates, and analyzing and comparing experimental results obtained by the algorithms according to the root mean square error and the filling accuracy.

And taking a certain voltage loss value of the actual power grid as a filling target, constructing loss data sets with different loss rates, and testing the performances of the three algorithms. In order to fully express the performance of each algorithm, 10 missing data sets are constructed for each missing rate in a mode of randomly generating missing values, the average value of results obtained by applying the algorithm to each data set is taken as a final experimental result, and the experimental results are synthesized for analysis and comparison.

From fig. 2, it can be seen that the improved random forest algorithm provided herein has the smallest root mean square error under all the deficiency rates, the filling effect is the best, and the root mean square error increases with the increase of the deficiency rate.

The missing value filling accuracy is reduced along with the increase of the missing rate, as shown in fig. 3, when the missing rate is 1%, the filling accuracy of the three algorithms can reach more than 60%, which indicates that each algorithm has better filling performance when a small amount of data is missing. The filling accuracy of the improved random forest algorithm provided by the invention is obviously better than that of the random forest algorithm when the deletion rate is 3% -15%, and the filling accuracy of the random forest algorithm is not greatly different from that of the in-situ algorithm when the deletion rate is more than 15%. Under all the missing conditions, the filling effect of the improved random forest algorithm is obviously better than that of the random forest algorithm and the in-place algorithm.

According to the root mean square error and the filling accuracy analysis, the filling effect of the improved random forest algorithm provided by the text is better than that of the other two algorithms, the missing rate is 10% in order to more visually display the actual filling effect of the algorithm, the data set comprises multiple sections of continuous missing data, and the improved random algorithm provided by the text is applied to fill the missing value of the power grid. Fig. 4 is a comparison result between the filling result of 27 groups of data missing continuously and the true value, and it can be seen that the correlation between the filling value and the true value is high, and the data filling requirement is met.

It should be understood that the above-described embodiments are illustrative only, and should not be taken as limiting the scope of the invention, since numerous modifications may be made by those skilled in the art without departing from the spirit and scope of the invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种电-气综合能源系统状态估计方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类