Dimension reduction method applied to gene expression profile data

文档序号：1467598 发布日期：2020-02-21 浏览：38次中文

阅读说明：本技术 一种应用于基因表达谱数据的降维方法 (Dimension reduction method applied to gene expression profile data ) 是由李�杰赵准周理王亚东于 2019-11-05 设计创作，主要内容包括：本发明提出了一种应用于基因表达谱数据的降维方法,属于计算机应用领域。步骤一：利用回归系数对样本数据矩阵进行“剪枝”,对“剪枝”后的矩阵进行主成分分析,得到主成分Uθ；步骤二：利用回归系数对样本数据矩阵进行“缩放”,对“缩放”后的矩阵进行主成分分析,得到主成分U’；步骤三：使用权值α对主成分Uθ和U’进行融合,得到新的主成分X’θ；步骤四：使用主成分X’θ建立分类回归模型,验证降维的效果。本发明解决了现有降维算法在不同特征数量下分类性能不稳定,建立在其上的分类器预测性能比较低的问题,分类效果不理想的问题。(The invention provides a dimension reduction method applied to gene expression profile data, and belongs to the field of computer application. The method comprises the following steps: pruning the sample data matrix by using the regression coefficient, and performing principal component analysis on the matrix after pruning to obtain a principal component U θ Secondly, zooming the sample data matrix by using the regression coefficient, and analyzing the principal components of the zoomed matrix to obtain a principal component U', and thirdly, using the weight α to analyze the principal component U θ And U 'are fused to obtain a new main component X' θ (ii) a Step four: using a main component X' θ And establishing a classification regression model and verifying the effect of dimension reduction. The invention solves the problems that the classification performance of the existing dimension reduction algorithm is unstable under different feature quantities, the prediction performance of a classifier established on the dimension reduction algorithm is low, and the classification effect is not ideal.)

1. A dimension reduction method applied to gene expression profile data is characterized by comprising the following steps:

the method comprises the following steps: pruning the sample data matrix by using the regression coefficient, and performing principal component analysis on the matrix after pruning to obtain a principal component U_θ；

Step two: carrying out ' scaling ' on the sample data matrix by using the regression coefficient, and carrying out principal component analysis on the matrix after the ' scaling ' to obtain a principal component U ';

thirdly, using the weight α to match the principal component U_θAnd U 'are fused to obtain a new main component X'_θ；

Step four: using a main component X'_θAnd establishing a classification regression model and verifying the effect of dimension reduction.

2. The dimension reduction method applied to gene expression profile data according to claim 1, wherein the step one comprises the following steps:

the method comprises the following steps: calculating a standard regression coefficient between each feature and the classification attribute;

the first step is: screening out characteristic data corresponding to all standard regression coefficient values larger than a threshold value theta to form a matrix after pruning, wherein the theta belongs to { x |0 is larger than or equal to x and is smaller than or equal to 1 };

step one is three: and (4) extracting main components of the matrix after pruning, and keeping the number of the main components as delta.

3. The dimension reduction method applied to gene expression profile data according to claim 2, wherein the steps are as follows: traditional linear regression models:

Y＝a+hX (1)

let h be the standard regression coefficient to measure the index of each univariate effect, and the following is provided according to the least square principle:

h_jshowing the influence of the sample x value on the dependent variable y in the jth feature, wherein

4. The method of claim 2, wherein the first step is to make H_θ＝{j||h_j|≥θ}，X_θA set of representations H_θData matrix corresponding to the middle feature, data X to X_θThe transformation of (A) is a pruning process with a supervision feature, eliminating branches not belonging to set H_θData corresponding to the characteristic of (a).

5. The dimension reduction method applied to gene expression profile data according to claim 2, wherein the step one and the step three are X_θThe SVD of (a) is represented by:

U_θcalled left singular vector, representing XX^TCharacteristic vector of (S)_θDiagonal matrices formed for singular values, the singular values being matrices XX^TIs the arithmetic square root of the non-negative eigenvalue of (1), V_θCalled right singular vector, as a set of orthogonal matrices, representing XX^TIn the expression, each column of the U matrix is a principal component sorted according to the magnitude of the singular value, that is: u. of₁，u₂，u₃，u₄，……，u_kAnd s is₁≥s₂≥s₃，……，s_kWherein u is₁Is the corresponding first principal component.

6. The dimension reduction method applied to the gene expression profile data according to claim 1, wherein the step two comprises the following steps:

step two, firstly: calculating a standard regression coefficient between each feature and the classification attribute;

step two: scaling the original sample data matrix value by using a standard regression coefficient to obtain a scaled matrix;

step two and step three: and (4) extracting principal components of the matrix after pruning, and reserving delta principal component numbers.

7. The dimension reduction method applied to gene expression profile data according to claim 6, wherein the step two is that the regression coefficient h of the jth feature is obtained from the formula (2)_jThe values are:

8. the dimension reduction method applied to gene expression profile data according to claim 6, wherein the second step is that in the linear regression model (1), the unit variation of X corresponds to the variation of Y of h units, that is, there is a scaling relationship of h times between X and Y, and if X is scaled and centered in the above manner, then there are:

X'＝h*X-mean(h*X) (5)。

9. the dimension reduction method applied to gene expression profile data according to claim 6, wherein the second step and the third step are defined as that X is a data matrix of nxp, n is the number of samples, p is a feature number, and the jth feature is taken as an example, for X_jAnd (3) matrix transformation is carried out:

in the formula X_ijDenotes the j gene on the i sample, h_jIs the regression coefficient of the jth feature and the dependent variable Y,

the SVD representation of X' is:

X'＝U'S'V' (7)

formula (7) is converted to the form of U 'by X':

X'＝U'S'V'^-1(8)

and processing the X' according to the number delta of the main components to be reserved.

10. The dimension reduction method applied to gene expression profile data according to claim 1, wherein in step three, the weight α is used to determine the principal component U_θAnd U 'are fused to obtain a new main component X'_θX 'to'_θFor the data after dimensionality reduction, equations (4) and (8) are substituted, and then:

wherein α is a weighting coefficient, and the value range is between 0 and 1.

11. The dimension reduction method applied to gene expression profile data according to claim 1, wherein in step four, specifically, the regression model is as follows:

Y＝a+bX+ε (10)

fusing weight to obtain new main component X'_θEquation (11), we can obtain:

Y＝a+bX'_θ+ε (11)

performing function conversion of the form g (Y) once on Y, wherein the form g (Y) is a Sigmoid function:

g(Y)＝1/(1+e^-Y) (12)

when Y tends to positive infinity, g (Y) tends to 1, and when Y tends to negative infinity, g (Y) tends to 0, the output values are limited to the interval (0,1) by the action of the function g (Y), g (Y) is a probability value that classifies the samples into a certain class,

substituting equation (11) into equation (12) can obtain a classification model:

a threshold β is set for g (u) to classify data, and if the threshold is set to 0.5, y is 1 if g (u) > 0.5, and if g (u) < 0.5, y is 0 and g (u) ═ 0.5 is a critical condition, in which case the classification accuracy is reduced,

and (3) bringing the training set into a formula (13), solving parameters a and b of the regression model, predicting the test set by using the obtained classification regression model, and verifying the classification effect.

Technical Field

The invention relates to a dimension reduction method applied to gene expression profile data, and belongs to the field of biological information.

Background

With the popularization of high-throughput technology application, gene expression profiling data is growing explosively. The data has the characteristics of high dimensionality and small sample size. Taking the gene expression profiling data of dengue disease samples (GSE25001) as an example, the data set contains 209 samples, referring to 22184 genes (dimensions). How to reduce the dimensionality of gene expression profiling data has become a hot spot of controversy for many researchers.

Principal component analysis (PCR) is widely studied as a classical dimension reduction algorithm. In addition, in various fields, a plurality of improved algorithms with better dimensionality reduction effects are also emerged, such as an improved segmentation principal component analysis (MPCA) method, a Mixed Bilinear Probability Principal Component Analysis (MBPPCA) method and a Deep PCA and kernel PCA cascade fusion (Deep PCA-KPCA) algorithm. The algorithms achieve a good dimensionality reduction effect, but most of the algorithms are optimized and improved aiming at sample data, and information related to classification labels is not fully utilized, so that although the selected features contain rich sample information, the classification effect is not ideal. Researchers hope to find an effective data dimension reduction method, help to accurately select the characteristics of the samples, and eliminate redundant characteristics irrelevant to the classification result, so as to find deep level connection in the data better and distinguish the difference of different types of samples, thereby achieving a more ideal classification effect.

To solve such problems, researchers have proposed SPCR and Y-aware PCR methods. Wherein, SPCR is to use the regression coefficient between the measured value of the sample and the target attribute under each characteristic to screen the characteristics of the sample, and only the characteristics of the sample which are closely related to the predicted attribute are considered. Such a transformation is more targeted. The method has good classification effect when the number of reserved main components is small, but the classification effect is gradually poor along with the increase of the number of reserved main components. The Y-aware PCR method uses regression coefficients to compress the entire matrix, ensuring that the data dimensionality reduction process is done with "perception" of the Y variable. Compared with the SPCR method, the method has a general classification effect when the main components are kept less, but the classification effect gradually becomes better along with the increase of the number of the main components.

In order to further improve the dimension reduction performance and improve the classification precision, the algorithm is improved on the basis of the SPCR and Y-aware PCR methods according to the advantages and the disadvantages of the respective algorithms, and the Y-SPCR method is provided.

Disclosure of Invention

The invention provides a dimension reduction method applied to gene expression profile data, and aims to solve the problem that the classification effect is not ideal although the selected characteristics contain rich sample information because the related information of classification labels is not fully utilized in the existing algorithm. The invention carries out experimental verification and analysis on SPCR and Y-aware PCR, finds that the dimensionality reduction performance of the SPCR is in inverse proportion to the number of reserved principal components, and the dimensionality reduction performance of the Y-aware PCR is in direct proportion to the number of reserved principal components. Aiming at the advantages and disadvantages of SPCR and Y-awarnePCR, the Y-awarnePCR and the SPCR are fused, and a weighted fusion algorithm (Y-SPCR) based on the SPCR and the Y-awarnePCR is provided. In Y-SPCR, the regression coefficient of each feature and a classification label is calculated firstly, the regression coefficient is used for carrying out pruning or zooming on original data, correlation analysis is carried out on two matrixes after pruning and zooming respectively to obtain the principal components of the two matrixes, the two principal components are fused by using the weight to obtain a new principal component, and a classification regression model is established by using the new principal component (the effect of verifying dimension reduction). "pruning" is the extraction of data under the characteristic that meets the threshold condition, and the elimination of data that does not meet the requirements. Scaling is the process of obtaining a sample space with better linear additivity effect by matrix transformation. The Y-SPCR not only inherits the dimensionality reduction performance of the SPCR and the Y-aware PCR, but also can reach the optimum under any characteristic number.

A dimension reduction method applied to gene expression profile data, the dimension reduction method comprising the steps of:

thirdly, using the weight α to match the principal component U_θFusing with U 'to obtain new main component X' theta;

step four: and establishing a classification regression model by using the principal component X' theta to verify the effect of dimension reduction.

Further, the first step comprises the following steps:

the method comprises the following steps: calculating a standard regression coefficient between each feature and the classification attribute;

step one is three: and (4) extracting main components of the matrix after pruning, and keeping the number of the main components as delta.

Further, the steps are as follows: traditional linear regression models:

Y＝a+hX (1)

let h be the standard regression coefficient to measure the index of each univariate effect, and the following is provided according to the least square principle:

h_jshowing the influence of the sample x value on the dependent variable y in the jth feature, wherein

Further, the first step and the second step are to make H θ ═ j | | | H_j|≥θ}，X_θA set of representations H_θData matrix corresponding to the middle feature, data X to X_θThe transformation of (A) is a pruning process with a supervision feature, eliminating branches not belonging to set H_θData corresponding to the characteristic of (a).

Further, the third step is that X_θThe SVD of (a) is represented by:

U_θcalled left singular vector, representing XX^TCharacteristic vector of (S)_θIs a strangeDiagonal matrix composed of singular values of the matrix XX^TIs the arithmetic square root of the non-negative eigenvalue of (1), V_θCalled right singular vector, as a set of orthogonal matrices, representing XX^TIn the expression, each column of the U matrix is a principal component sorted according to the magnitude of the singular value, that is: u. of₁，u₂，u₃，u₄，……，u_kAnd s is₁≥s₂≥s₃，……，s_kWherein u is₁Is the corresponding first principal component.

Further, the step two comprises the following steps:

step two, firstly: calculating a standard regression coefficient between each feature and the classification attribute;

step two: scaling the original sample data matrix value by using a standard regression coefficient to obtain a scaled matrix;

step two and step three: and (4) extracting principal components of the matrix after pruning, and reserving delta principal component numbers.

Further, the second step is that, as shown in the formula (2), the regression coefficient h of the jth feature_jThe values are:

further, in the second step, specifically, in the linear regression model (1), the unit variation of X corresponds to the variation of Y in h units, that is, there is a scaling relationship of h times between X and Y, and if X is "scaled" and centered as described above, there are:

X'＝h*X-mean(h*X) (5)。

further, the third step is to set X as a data matrix of n × p, n is the number of samples, p is the number of features, and take the jth feature as an example, for X_jAnd (3) matrix transformation is carried out:

in the formula X_ijOn the ith sampleJ gene, h_jIs the regression coefficient of the jth feature and the dependent variable Y,

the SVD representation of X' is:

X'＝U'S'V' (7)

formula (7) is converted to the form of U 'by X':

X'＝U'S'V'^-1(8)

and processing the X' according to the reserved main component number delta.

Further, in step three, specifically, the weight α is used to match the principal component U_θAnd U 'are fused to obtain a new main component X'_θX 'to'_θFor the data after dimensionality reduction, equations (4) and (8) are substituted, and then:

wherein α is a weighting coefficient, and the value range is between 0 and 1.

Further, in step four, specifically, the regression model is as follows:

Y＝a+bX+ε (10)

fusing weight to obtain new main component X'_θEquation (11), we can obtain:

Y＝a+bX'_θ+ε (11)

performing function conversion of the form g (Y) once on Y, wherein the form g (Y) is a Sigmoid function:

g(Y)＝1/(1+e^-Y) (12)

substituting equation (11) into equation (12) can obtain a classification model:

The main advantages of the invention are: the invention carries out experimental verification and analysis on SPCR and Y-aware PCR, and finds that the dimensionality reduction performance of the SPCR is in inverse proportion to the retention principal component number and the dimensionality reduction performance of the Y-aware PCR is in direct proportion to the retention principal component number. Aiming at the advantages and disadvantages of SPCR and Y-aware PCR, the fusion of the Y-aware PCR and the SPCR is carried out, and a weighted fusion algorithm (Y-SPCR) based on the SPCR and the Y-aware PCR is provided. In Y-SPCR, the regression coefficient of each feature and a classification label is calculated firstly, the regression coefficient is used for carrying out pruning or zooming on original data, correlation analysis is carried out on two matrixes after pruning and zooming respectively to obtain the principal components of the two matrixes, the two principal components are fused by using the weight to obtain a new principal component, and a classification regression model is established by using the new principal component (the effect of verifying dimension reduction). "pruning" is the extraction of data under the characteristic that meets the threshold condition, and the elimination of data that does not meet the requirements. Scaling is the process of obtaining a sample space with better linear additivity effect by matrix transformation. The Y-SPCR not only inherits the dimensionality reduction performance of the SPCR and the Y-aware PCR, but also can reach the optimum under any characteristic number. The invention solves the problem that the existing algorithm does not fully utilize the information related to the classification label, so that the classification effect is not ideal although the selected characteristics contain rich sample information.

Drawings

FIG. 1 is a data processing flow of the Y-SPCR algorithm;

fig. 2 is a graph of classification accuracy of each algorithm on different data sets, wherein fig. 2(a) preserves algorithm performance corresponding to different feature numbers for GSE 62627; FIG. 2(b) preserves algorithm performance for GSE2034 corresponding to different feature numbers; FIG. 2(c) preserves algorithm performance for GSE25001 corresponding to different feature numbers; fig. 2(d) preserves algorithm performance for GSE27272 for different feature numbers;

FIG. 3 is a line graph of scores of F1 of algorithms with different principal component numbers, wherein FIG. 3(a) is a score of F1 of GSE62627 with different principal component numbers; FIG. 3(b) shows F1 scores for GSE2034 with different principal component numbers; FIG. 3(c) shows F1 scores for different principal component numbers; FIG. 3(d) F1 scores for different principal component numbers;

FIG. 4 is a ROC curve of each algorithm under a GSE62627 data set; FIG. 4(a) is a ROC curve for the PCR algorithm; FIG. 4(b) is a ROC curve for the SPCR algorithm; FIG. 4(c) is a ROC curve of the Y-PCR algorithm; FIG. 4(d) is a ROC curve of the Y-SPCR algorithm; FIG. 4(e) is a ROC curve for the RFP algorithm; FIG. 4(f) is a ROC curve for the T-test algorithm;

FIG. 5 is a flow chart of a dimension reduction method applied to gene expression profile data according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 5, the present invention provides an embodiment of a dimension reduction method applied to gene expression profile data, the dimension reduction method comprising the following steps:

thirdly, using the weight α to match the principal component U_θAnd U 'are fused to obtain a new main component X'_θ；

Step four: using a main component X'_θAnd establishing a classification regression model and verifying the effect of dimension reduction.

In this preferred embodiment, the following steps are included in step one:

the method comprises the following steps: calculating a standard regression coefficient between each feature and the classification attribute;

step one is three: and (4) extracting main components of the matrix after pruning, and keeping the number of the main components as delta.

In the preferred embodiment of this section, the steps are specifically: traditional linear regression models:

Y＝a+hX (1)

let h be the standard regression coefficient to measure the index of each univariate effect, and the following is provided according to the least square principle:

h_jshowing the influence of the sample x value on the dependent variable y in the jth feature, wherein

In the preferred embodiment of this section, the first step is to make H θ ═ j | | | H_j|≥θ}，X_θA set of representations H_θData matrix corresponding to the middle feature, data X to X_θThe transformation of (A) is a pruning process with a supervision feature, eliminating branches not belonging to set H_θData corresponding to the characteristic of (a).

In the preferred embodiment of this section, step one and step three are specifically, X_θThe SVD of (a) is represented by:

U_θcalled left singular vector, representing XX^TThe feature vector of (2). S_θDiagonal matrix composed of singular values(other elements except diagonal are 0). Singular value being matrix XX^TIs the arithmetic square root of the non-negative eigenvalue of [38 ]]。V_θCalled right singular vector, as a set of orthogonal matrices, representing XX^TThe feature vector of (2). The matrix multiplication actually corresponds to a spatial transformation, and a new vector obtained after the matrix multiplication is to perform different rotations and expansions of the original vector towards another direction or length. Therefore, the singular value decomposition is essentially to rotate a vector from the V set of orthogonal basis space to the U orthogonal basis space, and to perform scaling in each direction according to S, and the S value corresponds to a specific scaling degree. In the expression, each column of the U matrix is a principal component sorted according to the magnitude of the singular value, that is: u. of₁，u₂，u₃，u₄，……，u_kAnd s is₁≥s₂≥s₃，……，s_k. Wherein u is₁Is the corresponding first principal component.

In this preferred embodiment, step two includes the following steps:

step two, firstly: calculating a standard regression coefficient between each feature and the classification attribute;

step two: scaling the original sample data matrix value by using a standard regression coefficient to obtain a scaled matrix;

step two and step three: and (4) extracting principal components of the matrix after pruning, and reserving delta principal component numbers, wherein the reserved principal component numbers delta are integers.

In the preferred embodiment of this section, the first step is specifically that, as can be seen from formula (2), the regression coefficient h of the jth feature_jThe values are:

in this preferred embodiment, in the second step, specifically, in the linear regression model (1), the unit variation of X corresponds to the variation of Y in h units, that is, there is a scaling relationship of h times between X and Y, and if X is "scaled" and centered as described above, there are:

X'＝h*X-mean(h*X) (5)。

in the preferred embodiment of this section, step two or step three is to set X as a data matrix of n × p, n is the number of samples, p is the number of features, and take the jth feature as an example, for X_jAnd (3) matrix transformation is carried out:

in the formula X_ijDenotes the j gene on the i sample, h_jThe regression coefficient (slope) of the jth feature and the dependent variable Y. If the feature j is more sensitive to the dependent variable Y (i.e., h)_jLarger), then its scaled X'_jHas more obvious variation amplitude and is more beneficial to prediction and classification.

The SVD representation of X' is:

X'＝U'S'V' (7)

formula (7) is converted to a form representing U ' by X ' (note V '^TV'＝1)：

X'＝U'S'V'^-1(8)

This Y-aware transformation method is complementary to the data dimension reduction process and ensures that the first principal component is the one that has the greatest effect on Y, i.e., the most characteristic variables will have the greatest load in the first principal component.

And processing the X' according to the reserved main component number delta.

In the preferred embodiment of this section, in step three, specifically, the weight α is used to match the principal component U_θAnd U 'are fused to obtain a new main component X'_θX 'to'_θFor the data after dimensionality reduction, equations (4) and (8) are substituted, and then:

α is a weighting coefficient, the value range is 0-1, the influence of the SPCR and the Y-aware PCR method on the performance is determined, α values are set to adjust the influence of the methods on the data, the algorithm with poor performance is subjected to strong scaling essentially, the algorithm with good performance is subjected to weak scaling essentially, and the Y-SPCR algorithm can obtain ideal performance in different scenes.

In the preferred embodiment of this section, in step four, specifically, in large-scale data analysis, regression analysis is a method for predicting the target property by modeling means, and actually attempts to find some specific relationship (linear or non-linear) between the target variable and the independent variable. The regression model is as follows:

Y＝a+bX+ε (10)

fusing weight to obtain new main component X'_θEquation (11), we can obtain:

Y＝a+bX'_θ+ε (11)

the algorithm uses logistic regression to solve the classification problem, namely, the function conversion of the form g (Y) is performed once on Y, and the form g (Y) is a Sigmoid function:

g(Y)＝1/(1+e^-Y) (12)

it has a very good property: when Y tends to positive infinity, g (Y) tends to 1, and when Y tends to negative infinity, g (Y) tends to 0, such that the output values are constrained to the interval (0,1) by the action of the function g (Y), g (Y) being the probability values that classify the samples into a class,

substituting equation (11) into equation (12) can obtain a classification model:

a threshold β is set for g (u) to classify the data (generally set to 0.5 depending on the actual analysis), and if the threshold is set to 0.5, y is 1 if g (u) > 0.5, and if g (u) < 0.5, y is 0 and g (u) < 0.5 is a critical condition, in which the classification accuracy is reduced,

and (5) bringing the training set into a formula (14), solving parameters a and b of the regression model, predicting the test set by using the obtained classification regression model, and verifying the classification effect.

26页详细技术资料下载

Dimension reduction method applied to gene expression profile data

相关技术

网友询问留言