Method for measuring correlation between three-dimensional variables and method for optimizing indexes

文档序号:1536343 发布日期:2020-02-14 浏览:32次 中文

阅读说明:本技术 一种三维变量间相关性衡量方法及指标优化方法 (Method for measuring correlation between three-dimensional variables and method for optimizing indexes ) 是由 王树良 耿晶 刘传鲁 于 2019-11-07 设计创作,主要内容包括:本发明公开了一种三维变量间相关性衡量方法及指标优化方法,涉及数据挖掘技术领域,能够实现对三维变量间相关性的衡量,并进一步地对难以直接进行优化的指标进行间接优化。该方法包括:构建三维变量;三维变量包括三个随机变量X、Y和Z。构建三维变量样本;依据三维变量样本建立三个随机变量X、Y和Z分布的三维散点图,其中三个随机变量X、Y和Z分别分布在x维度,y维度以及z维度上。以立方网格划分三维散点图,立方网格为一个x<Sub>0</Sub>×y<Sub>0</Sub>×z<Sub>0</Sub>的立方网格,x<Sub>0</Sub>,y<Sub>0</Sub>,z<Sub>0</Sub>进行随机取值。在每一种x<Sub>0</Sub>,y<Sub>0</Sub>,z<Sub>0</Sub>的取值情况下,计算三个随机变量X、Y和Z的最大互信息值,取所有最大互信息值中的最大值作为最大三维信息系数MTDIC。利用MTDIC作为三维变量间的相关性值。(The invention discloses a method for measuring correlation between three-dimensional variables and an index optimization method, relates to the technical field of data mining, and can be used for measuring correlation between the three-dimensional variables and further indirectly optimizing indexes which are difficult to directly optimize. The method comprises the following steps: constructing a three-dimensional variable; the three-dimensional variables include three random variables X, Y and Z. Constructing a three-dimensional variable sample; and establishing a three-dimensional scatter diagram of three random variables X, Y and Z distribution according to the three-dimensional variable sample, wherein the three random variables X, Y and Z are distributed in the x dimension, the y dimension and the Z dimension respectively. Dividing a three-dimensional scatter diagram by using a cubic grid, wherein the cubic grid is x 0 ×y 0 ×z 0 Cubic lattice of (a), x 0 ,y 0 ,z 0 And carrying out random value taking. At each kind of x 0 ,y 0 ,z 0 In the case of (3), the maximum mutual information values of the three random variables X, Y and Z are calculated, and the maximum value of all the maximum mutual information values is taken as the maximum three-dimensional information coefficient MTDIC. Using MTDIC as a three-dimensional variableThe correlation value between them.)

1. A method for measuring the correlation between three-dimensional variables is characterized by comprising the following steps:

constructing a three-dimensional variable; the three-dimensional variables include three random variables X, Y and Z;

acquiring actual data of three random variables and constructing a three-dimensional variable sample; establishing a three-dimensional scatter diagram of the distribution of the three random variables X, Y and Z according to the three-dimensional variable sample, wherein the three random variables X, Y and Z are distributed in an x dimension, a y dimension and a Z dimension respectively;

dividing the three-dimensional scatter diagram by using a cubic grid, wherein the cubic grid is x0×y0×z0Cubic lattice of (a), x0,y0,z0Dividing grid numbers in x dimension, y dimension and z dimension in the cubic grid respectively; x is the number of0,y0,z0Carrying out random value taking;

at each kind of x0,y0,z0Under the condition of the value of (3), calculating the maximum mutual information values of the three random variables X, Y and Z, and taking the maximum value of all the maximum mutual information values as a maximum three-dimensional information coefficient MTDIC;

MTDIC is used as the correlation value among the three-dimensional variables.

2. The method of claim 1, wherein the calculation is at each x0,y0,z0Under the condition of the value of (3), the maximum mutual information values of the three random variables X, Y and Z are specifically as follows:

taking a finite set of random variables X, Y and Z; d | (x)0,y0,z0) Is an x on the set D0×y0×z0A divided cubic grid set G; i (D | (x)0,y0,z0) In a cubic grid set G)Mutual information values under a division mode;

Figure FDA0002264440730000011

wherein in the cubic grid set G, the sample space of the random variable X is randomly divided into X0Sequence, the sample space of the random variable Y is divided randomly Y0Sequence, the sample space of the random variable Z being divided randomly Z0A sequence; calculate each x0,y0,z0The maximum mutual information values of the three random variables X, Y and Z;

wherein p (x)i) The random variable X in the cubic grid set G belongs to the X-thiThe probability of an individual sequence; p (y)j) Is that the random variable Y in the cubic grid set G belongs to the Y-thjThe probability of an individual sequence; p (z)k) The random variable Z in the cubic grid set G belongs to the Z-thkThe probability of an individual sequence; i is taken over 1 to x0J is taken over from 1 to yjK is taken over from 1 to zkAll integers in between;

p(xi,yj) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Y belongs to the YjJoint probabilities of the sequences; p (x)i,zk) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Z belongs to ZkJoint probabilities of the sequences; p (y)j,zk) Is that the random variable Y in the cubic grid set G belongs to the Y-thjSequence and random variable Z belongs to ZkJoint probabilities of the sequences; p (x)i,yj,zk) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Y belongs to the YjSequence and random variable Z belongs to ZkJoint probability of individual sequences.

3. The method of claim 2, wherein x is0,y0,z0Value range ofThe enclosure is as follows:

x0×y0×z0<B;

where B is a function of the three-dimensional variable sample size N, where B is N0.6

4. The method according to any one of claims 1 to 3, wherein the dividing of the three-dimensional scatter plot by the cubic grid is specifically:

in the cubic grid, in x dimension, according to x0Is evenly divided according to y in the y dimension0Is evenly divided in z dimension according to z0The values of (a) are randomly divided.

5. An index optimization method based on correlation measurement among three-dimensional variables is characterized in that less than two indexes to be optimized are selected, the indexes to be optimized and other selected indexes form an index three-dimensional variable, and the existing correlation measurement is carried out on the index three-dimensional variable by adopting the correlation measurement method among the three-dimensional variables as claimed in any one of claims 1 to 4 to obtain the correlation value of the index three-dimensional variable;

if the relevance value of the index three-dimensional variable exceeds a set relevance threshold, the other selected indexes are indexes relevant to the index to be optimized;

optimizing the index to be optimized by adjusting the relevant index;

the set correlation threshold is set according to an empirical value.

Technical Field

The invention relates to the technical field of data mining, in particular to a method for measuring correlation between three-dimensional variables and a method for optimizing indexes.

Background

In data mining analysis, correlation analysis among variables is an important ring. Imagine that there are hundreds of variables in a data set, then it can be combined into thousands of variable combinations, with many important dependencies hidden between them. Mining this potential relationship is becoming more and more meaningful. In the study of correlation analysis between variables, many methods are currently used to measure the correlation between two variables. Such as Pearson's correction Coefficient, Distance correction, maximum correction, random current-based method, maximum Correlation Coefficient, and Maximum Information Coefficient (MIC). However, the present research is relatively rare regarding the correlation measurement between multiple variables, and the basis is still mainly to analyze the correlation between two variables.

An article on the new statistical method MIC, published in the journal Science, mentions two important properties in measuring the correlation between variables, namely versatility and homogeneity. Generality means that for enough sample data, the statistical method can capture a wide range of relationship types, not only limited to specific functional types, such as linear, exponential, logarithmic or parabolic functional relationships, but also all other functional relationships are covered. Many important types of relationships are not just single functional forms, but many are not described by a specific function, such as a superposition of two functions. Fairness, in turn, means that the correlation metric method can give correlation values that are very close to each other for relationships with the same noise level, but not the same. For example, if a linear relationship with noise is not desired to mask a strong sinusoidal relationship. Homogeneity is difficult to formalize for general relations, but a clear explanation can be given for the basic case of function type: a statistical method with uniformity for R2Value (R)2A coefficient of decision for linear regression) should give a similar metric. For example, at a reasonable sample size, a noise level R2Sine relationship of 0.8 and having the same R2The linear relationship of the values should have similar MIC values.

The conventional method for measuring the correlation between two random variables at present hardly satisfies both the characteristics of universality and uniformity. The method for measuring the correlation between three-dimensional variables is basically based on the analysis of the correlation between two variables by the traditional method. Rather than treating the three random variables as a whole. This exposes two significant problems. Firstly, a bivariate relevance measurement method with defects in uniformity and universality is adopted to process the relation between multidimensional variables, so that the strong and weak results of the relevance between three-dimensional variables also have deviation. Secondly, on the problem of analyzing the correlation among the three-dimensional variables, the correlation between every two variables is still adopted for processing, and a statistic for measuring the overall correlation among the three-dimensional variables is lacked as a theoretical support.

At present, the data analysis of the correlation among the variables is widely applied, the most obvious application is an indirect optimization mode aiming at some physical indexes which are difficult to optimize, for the physical indexes which are difficult to optimize, other indexes with strong correlation with the indexes can be found by solving the correlation between the indexes and other indexes, and the physical indexes which are difficult to directly optimize are optimized by optimizing other indexes.

Particularly, if two physical indexes which are difficult to directly optimize exist, a certain correlation exists between the two physical indexes, so that another physical index which has strong correlation with the two kinds of physical indexes which are easy to optimize and are directly optimized by ease can be found in a three-dimensional variable correlation solving mode, and the purpose of simultaneously optimizing the two kinds of physical indexes which are difficult to directly optimize can be achieved by optimizing the physical index.

For example, blue-green algae (cyanobacteria) polluted by lakes are paid much attention in the world at present, but no growth inhibition method aiming at the blue-green algae exists at present, so that other algae and water quality indexes with strong correlation with the blue-green algae can be calculated by a three-dimensional variable method, and the control of the biochemical indexes of the water quality with strong correlation by an environmental protection department is facilitated.

Therefore, a method for measuring the overall correlation among three-dimensional variables is urgently needed in both data mining processing and physical index optimization.

Disclosure of Invention

In view of this, the invention provides a method for measuring correlation between three-dimensional variables and an index optimization method, which can obtain a quantitative value of correlation between three-dimensional variables, thereby measuring correlation between three-dimensional variables and further indirectly optimizing an index which is difficult to directly optimize.

In order to achieve the above object, an embodiment of the present invention provides a method for measuring correlation between three-dimensional variables, including:

constructing a three-dimensional variable; the three-dimensional variables include three random variables X, Y and Z.

Acquiring actual data of three random variables and constructing a three-dimensional variable sample; and establishing a three-dimensional scatter diagram of three random variables X, Y and Z distribution according to the three-dimensional variable sample, wherein the three random variables X, Y and Z are distributed in the x dimension, the y dimension and the Z dimension respectively.

Dividing a three-dimensional scatter diagram by using a cubic grid, wherein the cubic grid is x0×y0×z0Cubic lattice of (a), x0,y0,z0Dividing grid numbers in x dimension, y dimension and z dimension in the cubic grid respectively; x is the number of0,y0,z0And carrying out random value taking.

At each kind of x0,y0,z0In the case of (3), the maximum mutual information values of the three random variables X, Y and Z are calculated, and the maximum value of all the maximum mutual information values is taken as the maximum three-dimensional information coefficient MTDIC.

MTDIC is used as the correlation value between three-dimensional variables.

Further, a calculation is made at each x0,y0,z0Under the condition of the value of (3), the maximum mutual information values of the three random variables X, Y and Z are specifically as follows:

taking a finite set of random variables X, Y and Z; d | (x)0,y0,z0) Is an x on the set D0×y0×z0A divided cubic grid set G; i (D | (x)0,y0,z0) Is a mutual information value in a division manner of the cubic grid set G.

Figure BDA0002264440740000041

Wherein in the cubic grid set G, the sample space of the random variable X is randomly divided into X0Sequence, the sample space of the random variable Y is divided randomly Y0Sequence, the sample space of the random variable Z being divided randomly Z0A sequence; calculate each x0,y0,z0The maximum mutual information value of the three random variables X, Y and Z.

Wherein p (x)i) The random variable X in the cubic grid set G belongs to the X-thiThe probability of an individual sequence; p (y)j) Is that the random variable Y in the cubic grid set G belongs to the Y-thjThe probability of an individual sequence; p (z)k) The random variable Z in the cubic grid set G belongs to the Z-thkThe probability of an individual sequence; i is taken over 1 to x0J is taken over from 1 to yjK is taken over from 1 to zkAll integers in between.

p(xi,yj) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Y belongs to the YjJoint probabilities of the sequences; p (x)i,zk) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Z belongs to ZkJoint probabilities of the sequences; p (y)j,zk) Is that the random variable Y in the cubic grid set G belongs to the Y-thjSequence and random variable Z belongs to ZkJoint probabilities of the sequences; p (x)i,yj,zk) The random variable X in the cubic grid set G belongs to the X-thiSequence and random variable Y belongs to the YjSequence and random variable Z belongs to ZkJoint probability of individual sequences.

Further, x0,y0,z0The value range is as follows: x is the number of0×y0×z0< B; where B is a function of the three-dimensional variable sample size N, where B is N0.6

Further, dividing a three-dimensional scatter diagram by using a cubic grid specifically comprises the following steps:

in cubic grid, in x dimension, according to x0Is evenly divided according to y in the y dimension0Is evenly divided in z dimension according to z0The values of (a) are randomly divided.

Another embodiment of the invention further provides an index optimization method based on correlation measurement among three-dimensional variables, wherein the method comprises the steps of selecting less than two indexes to be optimized, combining the indexes to be optimized and other selected indexes into an index three-dimensional variable, and performing the existing correlation measurement on the index three-dimensional variable by adopting the correlation measurement method among the three-dimensional variables to obtain the correlation value of the index three-dimensional variable.

And if the relevance value of the index three-dimensional variable exceeds a set relevance threshold, other selected indexes are indexes relevant to the index to be optimized.

And optimizing the index to be optimized by adjusting the relevant index.

The set correlation threshold is set based on an empirical value.

Has the advantages that:

1. the invention provides a method for measuring the correlation between three-dimensional variables, which obtains the correlation between the three-dimensional variables by dynamic division and mutual information calculation of the three-dimensional space of the variables.

2. The invention also provides an index optimization method based on the three-dimensional variable correlation measurement method, and other indexes with stronger correlation with the index are found by solving the correlation between the index and other indexes aiming at some physical indexes which are difficult to optimize by using the three-dimensional variable correlation measurement method, so that the aim of indirectly optimizing the physical indexes which are difficult to directly optimize is fulfilled by optimizing the other indexes.

Drawings

Fig. 1 is a flowchart of a method for measuring correlation between three-dimensional variables according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:点云视点和可扩展压缩/解压

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!