Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace

文档序号:1075102 发布日期:2020-10-16 浏览:5次 中文

阅读说明:本技术 互斥性约束图拉普拉斯的异质性癌症驱动基因识别方法 (Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace ) 是由 习佳宁 黄庆华 于 2020-06-23 设计创作,主要内容包括:本发明提供了一种互斥性约束图拉普拉斯的异质性癌症驱动基因识别方法。首先,获取癌症基因组变异数据与基因互作关系网络;然后,采用矩阵化模型对癌症的异质性进行描述,并通过互斥性约束矩阵分解对异质性癌症的样本参数进行差异化估计;接着,构建联合关联互作网络正则化的互斥性约束矩阵分解优化函数,并通过迭代求解对局部样本中的受互作影响驱动基因参数进行修正;最后,采用离群值检验方法识别驱动基因。本发明能够解决癌症样本的参数差异化估计与受互作影响局部样本驱动基因的有效识别问题,实现从异质性癌症样本的基因变异数据中识别出在局部样本发生变异的驱动基因。(The invention provides a heterogeneous cancer driver gene identification method based on mutual exclusion constraint graph Laplace. Firstly, acquiring a cancer genome variation data and gene interaction relation network; then, describing heterogeneity of the cancer by adopting a matrix model, and performing differential estimation on sample parameters of the heterogeneous cancer through mutual exclusion constraint matrix decomposition; then, constructing a mutual exclusion constraint matrix decomposition optimization function for joint correlation interaction network regularization, and correcting the parameters of the interaction-influenced driving genes in the local samples through iterative solution; finally, the driver genes were identified using outlier testing. The method can solve the problems of parameter differentiation estimation of the cancer sample and effective identification of the local sample driver gene influenced by interaction, and realizes identification of the driver gene with variation in the local sample from gene variation data of the heterogeneous cancer sample.)

1. A heterogeneous cancer driver gene identification method of mutual exclusion constraint graph Laplace is characterized by comprising the following steps:

step 1: acquiring cancer genome variation data and a gene interaction relationship network, and performing ID unification on gene names through a gene annotation database;

step 2: according to whether each gene is generated or not in the inputted cancer sampleIn the case of variation, each sample is constructed into 0/1 vectors, the elements of the vectors are 0/1 values of whether each gene is varied, and all 0/1 vectors are spliced to form a variation matrix X ═ Xij]m×nWherein m is the number of samples, n is the number of genes, i is 1,2, …, m, j is 1,2, …, n, xijIs the ith row and jth column element in the matrix X, if there is variation in the jth gene of the ith sample, XijThe value is 1, otherwise the value is 0;

and step 3: for the input m samples and n genes, a null matrix U ═ U is setik]m×rAnd V ═ Vjk]n×rWherein r is a given parameter dimension, r < m, n, the matrix U is recorded as a sample parameter matrix, each element thereof is recorded as a sample parameter, UikThe kth parameter of the ith sample in r-element space is represented, the matrix V is recorded as a gene parameter matrix, each element of the gene parameter matrix is recorded as a gene parameter, VjkK is the k parameter of the j sample in the r-ary space, k is the index of each dimension parameter of the r-ary space, and k is 1,2, …, r;

and 4, step 4: solving a matrix decomposition model of mutual exclusion constraint to obtain sample parameters and gene parameters in the matrixes U and V:

Figure FDA0002553749240000011

wherein l represents that the current l-th parameter of the r-element space is being investigated, and is 1,2, …, r is an adjustable threshold value, and the value range is (0, 0.1);

and 5: the graph Laplace regularization term Reg of the gene parameters is obtained by calculation according to the following formulaY(V):

Wherein, Y represents a set composed of interaction relations among genes, Y { (i, j) | i and j have an interaction relation }, i and j represent the ith gene and the jth gene respectively, and (i, j) represents a gene pair composed of two genes having an interaction relation; v. ofiIn matrix VRow i of (1), vjIs the jth row in the matrix V; i is an indicative function of the number of the linear channels,

Figure FDA0002553749240000013

iteratively solving a matrix decomposition optimization function of the following combined graph Laplace regularization items to obtain a sample parameter matrix U 'and a gene parameter matrix V' of the fusion gene interaction network:

Figure FDA0002553749240000021

wherein, λ represents a tuning parameter of the regularization term, and the value is a real number greater than zero;

step 6: regarding the matrix U' obtained in step 5, taking the sample corresponding to the index of the maximum value of the k-th column element as the local sample of the k-th group subgroup, where k is 1,2, …, r, and obtaining local samples of all r group subgroups;

and 7: and (3) performing driving gene detection on the r groups of sub-group possessed samples by respectively adopting an outlier detection method to obtain the original hypothesis distribution corresponding to each group of local samples, wherein the method specifically comprises the following steps:

firstly, for the kth subgroup in the r subgroups, k is 1,2, …, r, selecting the corresponding row of the local sample of the kth subgroup in the variation matrix X to form the kth sub-matrix corresponding to the kth subgroup, and performing random walk processing on the kth sub-matrix by using a restarted random walk algorithm and a gene interaction relation set Y; then, randomly rearranging a matrix after random walk, adding all 1 xn-dimensional row vectors in the rearranged matrix, wherein the added 1 xn-dimensional vector is a distribution horizontal sampling of n genes, the value of n dimensions in the vector represents the distribution level of the n genes in the current sampling, repeating the random rearrangement sampling for 10000 times to obtain the sampling result of n genes in 10000 times, taking the value of the 10000 times of the sampling result of the n genes, and constructing n value frequency distribution graphs for the n gene distribution to serve as n original hypothesis distributions corresponding to the n genes of the local sample of the kth group of subgroups respectively;

and 8: in n original hypothesis distributions of the kth group of local samples, the original price distribution corresponding to the jth gene is the jth original hypothesis distribution in the n original hypothesis distributions, and the element V ' in the gene parameter matrix V ' is used 'jkAnd comparing the value with the abscissa of the jth original hypothesis distribution, taking the area on the right side of the comparison position of the distribution function as an inspection p value, correcting the error discovery rate of the inspection p value by a Benjamini-Hochberg error discovery rate correction algorithm to obtain a corrected p value, if the corrected p value is less than 0.05, considering the jth gene as the driving gene of the kth group of local samples, taking the value of j from 1 to n, taking the value of k from 1 to r, and processing according to the process to obtain the identification result of whether each gene is the driving gene of all the local samples.

Technical Field

The invention belongs to the technical field of bioinformatics and genome data mining, and particularly relates to a heterogeneous cancer driver gene identification method based on mutual exclusion constraint graph Laplace.

Background

Cancer is a highly malignant disease, mainly caused by the variation of a driver gene. However, there are many accompanying mutations in cancer genomes that are not associated with canceration, and the driver genes are seriously confused. Since the driver gene variation is more likely to occur simultaneously in multiple samples than the concomitant variation, the existing research mainly considers the driver gene as the high-frequency variation gene in the multiple samples according to the gene variation data of the cancer sample, and searches the driver gene with the significant high-frequency variation in the multiple samples by examining the statistical significance of the gene variation rate. For example, Lawrence et al, in Lawrence S, Stojanov P, Polak P, et al, biological specificity in cancer and the search for new cancer-associated genes [ J ] Nature,2013,499(7457):214, propose a statistical test method of variation frequency based on the correction of the background variation rate of each gene to identify genes with significant high frequency variation in cancer samples. Kumar et al further applied refinement constraints to driver prediction in "Kumar D, Swamidiss J, Bose R.Unsurmounted detection of cancer drivers with rational-defined learning [ J ]. Nature genetics,2016,48(10):1288" to reduce false positive results in recognition. However, the complexity of driver gene distribution in local samples is exacerbated by the fact that cancer also has tumor heterogeneity, i.e., there is a large difference in the variant drivers of different samples. For cancers with tumor heterogeneity, the variation rate exhibited by the variant driver in local samples only relative to the global samples is low due to the large difference in variant drivers among different samples. Although the existing research can respectively identify the genes with high variation rate in various samples when the subclasses of the samples of the heterogeneous cancers are known, the local samples with differences cannot be distinguished when the types of the samples are lacked, so that the local sample driving genes of the heterogeneous cancers cannot be identified.

Because the driver gene can be influenced by the interaction of other variant genes to generate abnormality, related researches also take the variation rate as the function abnormality influence of each gene, and model is carried out through gene interaction relation transmission so as to screen the driver gene with higher influence degree in the results. For example, Raphael et al, Leiserson D, Vandin F, Wu H, et al, Pan-cancer network analysis principles associations of random genetic mutation and protein complexes [ J ]. Nature genetics,2015,47(2):106", have the frequency of gene variation as the influence, spread the influence by gene interaction, and have the post-spread score as the degree of influence of each gene by the interaction, to identify the cognate drivers common to cancer samples. Since the propagation described above affects a large number of non-mutated genes, and thus leads to the problem of false positives in the prediction results, Cho et al in "ChoA, Shim E, Kim E, et al. MUFFINN: cancer gene discovery Via network analysis of genetic organization data [ J ]. Genome biology,2016,17(1):129" only allow high frequency mutated genes to affect directly interacting genes, and limit the propagation process to avoid false identifications due to multi-stage propagation. To filter irrelevant genes at the level of statistical significance, Horn et al, in the literature, "Horn H, Lawrence S, Chouinard R, equivalent. NetSig: network-based discovery from cancer genes [ J ]. Nature methods,2018,15(1):61", significantly describe the influence of high frequency variant genes by interaction relationships, and further improve the prediction rate of common relevance driver genes. However, the above-mentioned research based on the interaction propagation only considers the influence of the gene mutation rate on the whole sample level, and still cannot consider the influence of the gene interaction on the local sample. For heterogeneous cancers, the influence of gene interaction on local samples is lost in modeling, so that the driving genes influenced by gene interaction are missed in the local samples.

In summary, the following problems exist in the current research: 1) when the sample type of the heterogeneous cancer is absent, the variant driving genes have large difference among different samples, so that the variant driving genes have low variation rate among all samples and are difficult to effectively identify; 2) in the existing methods, the variation frequency of genes in the whole sample is taken as a standard, but in the heterogeneous cancer, the drive genes influenced by gene interaction are missed in local samples. Therefore, due to the lack of a method for differentiating cancer samples when the types of the samples are lacked, the problem of missing detection of the local sample driving genes affected by the interaction is not well recognized, and the pathogenic mechanism and the clinical diagnosis and treatment research development of the heterogeneous cancer are restricted.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a heterogeneous cancer driver gene identification method of mutual exclusion constraint graph Laplace. Firstly, acquiring a cancer genome variation data and gene interaction relation network; then, describing heterogeneity of the cancer by adopting a matrix model, and performing differential estimation on sample parameters of the heterogeneous cancer through mutual exclusion constraint matrix decomposition; then, constructing a mutual exclusion constraint matrix decomposition optimization function for joint correlation interaction network regularization, and correcting the parameters of the interaction-influenced driving genes in the local samples through iterative solution; finally, the driver genes were identified using outlier testing. The method can solve the problems of parameter differentiation estimation of the cancer sample and effective identification of the local sample driver gene influenced by interaction, and realizes identification of the driver gene with variation in the local sample from gene variation data of the heterogeneous cancer sample.

A heterogeneous cancer driver gene identification method of mutual exclusion constraint graph Laplace is characterized by comprising the following steps:

step 1: acquiring cancer genome variation data and a gene interaction relationship network, and performing ID unification on gene names through a gene annotation database;

step 2: constructing each sample into 0/1 vector according to the condition whether each gene in the inputted cancer sample has variation, the elements of the vector are 0/1 values whether each gene has variation, and splicing all 0/1 vectors to form variation matrix X ═ Xij]m×nWherein m is the number of samples, n is the number of genes, i is 1,2, …, m, j is 1,2, …, n, xijIs the ith row and jth column element in the matrix X, if there is variation in the jth gene of the ith sample, XijThe value is 1, otherwise the value is 0;

and step 3: for the input m samples and n genes, a null matrix U ═ U is setik]m×rAnd V ═ Vjk]n×rWherein r is a given parameter dimension, r < m, n, the matrix U is recorded as a sample parameter matrix, each element thereof is recorded as a sample parameter, UikThe kth parameter of the ith sample in r-element space is represented, the matrix V is recorded as a gene parameter matrix, each element of the gene parameter matrix is recorded as a gene parameter, VjkK is the k parameter of the j sample in the r-ary space, k is the index of each dimension parameter of the r-ary space, and k is 1,2, …, r;

and 4, step 4: solving a matrix decomposition model of mutual exclusion constraint to obtain sample parameters and gene parameters in the matrixes U and V:

wherein l represents that the current l-th parameter of the r-element space is being investigated, and is 1,2, …, r is an adjustable threshold value, and the value range is (0, 0.1);

and 5: the graph Laplace regularization term Reg of the gene parameters is obtained by calculation according to the following formulaY(V):

Wherein, Y represents a set composed of interaction relations among genes, Y { (i, j) | i and j have an interaction relation }, i and j represent the ith gene and the jth gene respectively, and (i, j) represents a gene pair composed of two genes having an interaction relation; v. ofiIs the ith row in the matrix V, VjIs the jth row in the matrix V; i is an indicative function, IY(I, j) is an element in the indicative function, if there is an adjacent edge between gene I and gene j in the gene interaction network, then IY(I, j) takes a value of 1, otherwise IY(i, j) takes the value 0;

iteratively solving a matrix decomposition optimization function of the following combined graph Laplace regularization items to obtain a sample parameter matrix U 'and a gene parameter matrix V' of the fusion gene interaction network:

Figure BDA0002553749250000041

wherein, λ represents a tuning parameter of the regularization term, and the value is a real number greater than zero;

step 6: regarding the matrix U' obtained in step 5, taking the sample corresponding to the index of the maximum value of the k-th column element as the local sample of the k-th group subgroup, where k is 1,2, …, r, and obtaining local samples of all r group subgroups;

and 7: and (3) performing driving gene detection on the r groups of sub-group possessed samples by respectively adopting an outlier detection method to obtain the original hypothesis distribution corresponding to each group of local samples, wherein the method specifically comprises the following steps:

firstly, for the kth subgroup in the r subgroups, k is 1,2, …, r, selecting the corresponding row of the local sample of the kth subgroup in the variation matrix X to form the kth sub-matrix corresponding to the kth subgroup, and performing random walk processing on the kth sub-matrix by using a restarted random walk algorithm and a gene interaction relation set Y; then, randomly rearranging a matrix after random walk, adding all 1 xn-dimensional row vectors in the rearranged matrix, wherein the added 1 xn-dimensional vector is a distribution horizontal sampling of n genes, the value of n dimensions in the vector represents the distribution level of the n genes in the current sampling, repeating the random rearrangement sampling for 10000 times to obtain the sampling result of n genes in 10000 times, taking the value of the 10000 times of the sampling result of the n genes, and constructing n value frequency distribution graphs for the n gene distribution to serve as n original hypothesis distributions corresponding to the n genes of the local sample of the kth group of subgroups respectively;

and 8: in n original hypothesis distribution of the kth group of local samples, the original price distribution corresponding to the jth gene is the jth original hypothesis distribution in the n original hypothesis distributions, and the element V in the gene parameter matrix V' is usedjkAnd comparing the value with the abscissa of the jth original hypothesis distribution, taking the area on the right side of the comparison position of the distribution function as an inspection p value, correcting the error discovery rate of the inspection p value by a Benjamini-Hochberg error discovery rate correction algorithm to obtain a corrected p value, if the corrected p value is less than 0.05, considering the jth gene as the driving gene of the kth group of local samples, taking the value of j from 1 to n, taking the value of k from 1 to r, and processing according to the process to obtain the identification result of whether each gene is the driving gene of all the local samples.

The invention has the beneficial effects that: 1) aiming at the large difference of the samples of the heterogeneous cancers in the genome variation layer, under the condition of sample type deletion, the problem that the samples of the heterogeneous cancers are difficult to distinguish is solved by carrying out differential estimation on the parameters of the cancer samples. 2) Aiming at the problem of missing detection of the driving gene in a local sample under the influence of gene interaction, the driving gene parameters are regularized through the gene interaction relation, the relevance bridging of the gene parameters on the interaction influence layer is realized, and the driving gene identification performance of the local sample is improved.

Drawings

FIG. 1 is a flow chart of the inventive mutual exclusion constraint graph Laplace heterogeneous cancer driver identification method.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a method for identifying a heterogeneous cancer driver gene with mutual exclusion constraint graph laplace, which is implemented as follows:

step 1: the Cancer gene mutation data were obtained from The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) databases. Data for the gene interaction network is collected from the STRING interaction database. The ID unification of gene names is carried out by a gene Annotation Database for Annotation, Visualization and normalized Discovery (DAVID), so as to eliminate the phenomenon of synonyms and synonyms of genes in data of different sources.

Step 2: the heterogeneity of the cancer is described by using a matrix model, and the rows and columns of the matrix respectively describe the differential local samples and the corresponding driving genes thereof. Different local samples are described through multivariate parameters, and the difference of parameter values represents the difference between the local samples so as to effectively quantify the difference of the samples. The method specifically comprises the following steps: constructing 0/1 vector for each sample according to the variation condition of each gene in the inputted cancer sample, wherein the elements of the vector are 0/1 values of whether each gene is varied, and all 0/1 vectors are spliced to form a variation matrix X ═ Xij]m×nWherein m is the number of samples, n is the number of genes, i is 1,2, …, m, j is 1,2, …, n, xijIs the ith row and j column elements in the matrix X if the jth sample of the ith row and j column elementsVariation of gene xijThe value is 1, otherwise the value is 0.

And step 3: and describing the abnormal degree of the gene in the local sample by adopting a plurality of parameters to reflect the driving gene of each local sample of the heterogeneous cancer, and recording the driving gene as a sample parameter and a gene parameter. That is, for m samples and n genes inputted, an empty matrix U ═ U is setik]m×rAnd V ═ Vjk]n×rWherein r is a user-defined parameter dimension, r < m, n is required to be satisfied at the time of setting, the matrix U is recorded as a sample parameter matrix, each element of the matrix U is recorded as a sample parameter, U is recorded as a sample parameterikThe kth parameter of the ith sample in r-element space is represented, the matrix V is recorded as a gene parameter matrix, each element of the gene parameter matrix is recorded as a gene parameter, VjkThe kth parameter, k ═ 1,2, …, r, representing the jth sample in r-ary space.

And 4, step 4: estimating low-dimensional parameters of the variant data by using mutual exclusion constraint matrix decomposition to ensure the difference of different local samples:

Figure BDA0002553749250000051

wherein l represents that the ith parameter of the r-element space is currently being investigated, k is an index of each dimension parameter of the r-element space, and is a threshold value adjustable by a user, and the range of the available value is (0, 0.1).

The above formula realizes mutual exclusion constraint by limiting the covariance to a smaller value, so as to highlight different local samples with larger genomic variation difference, and perform differential estimation on sample parameters of the heterogeneous cancer. In the estimation result, the sample parameters and the gene parameters represent an original data matrix capable of approximately reconstructing the genetic variation through matrix multiplication, wherein each row describes the difference between samples in the genetic variation level, and each column represents the abnormal degree of each gene in a local sample.

And 5: expressing the relationship of whether each gene in the gene interaction network has an adjacent edge by an indicative function I, namely, if the genes I and j have adjacent edges, IY(i,j)Value is 1, otherwise IY(i,j)The value is 0. According to the geneAnd (3) regularizing the distance of each gene parameter by adopting a graph Laplace method in the incidence relation of the interaction network:

wherein, RegY(V) is the graph Laplace regularization term of the gene parameters. Y represents a set of genes having an interaction relationship, that is, Y { (i, j) | i has an interaction relationship with j }, where i and j represent the ith and jth genes, respectively, and (i, j) represents a pair of genes having two genes having an interaction relationship. v. ofiIs the ith row in the matrix V, VjIs the jth row in the matrix V.

The above formula can reflect the degree of correlation of the gene parameters with respect to the interaction network in each local sample. By carrying out joint estimation on the regularization items of the relevance bridging, sample parameters and gene parameters, the following regularized mutual exclusivity constraint matrix decomposition optimization functions of the joint relevance interaction network are constructed:

and λ represents a tuning parameter of the regularization term, is self-determined by a user, and takes a real number greater than zero.

By carrying out iterative solution on the above formula, the gene parameters of the local samples with the interaction relationship are gradually adjacent to each other, so that the gene parameters influenced by the interaction in the local samples are corrected.

Step 6: the finally obtained sample parameter matrix U' for reflecting the local sample indication relationship is used for taking the sample corresponding to the index of the maximum value of the k-th column element as the local sample of the k-th group subgroup, where k is 1,2, …, r, and obtaining the local samples of all r group subgroups;

and 7: and (3) performing driving gene detection on the r groups of sub-group possessed samples by respectively adopting an outlier detection method to obtain the original hypothesis distribution corresponding to each group of local samples, wherein the method specifically comprises the following steps:

firstly, for the k-th subgroup in the r subgroups, selecting the corresponding row of the local samples of the k-th subgroup in the variation matrix X, and reconstructing the k-th sub-matrix corresponding to the k-th subgroup. And performing random walk processing on the kth sub-matrix by using a restarted random walk algorithm and utilizing a gene interaction relation set Y, then performing random rearrangement on a matrix after random walk, adding all 1 xn-dimensional row vectors in the rearranged matrix, wherein the added 1 xn-dimensional vector is the distribution horizontal sampling of n genes, and the values of n dimensions in the vector represent the distribution level of the n genes which are sampled at this time. Repeating the random rearrangement sampling for 10000 times to obtain 10000 times of sampling results of the whole n genes, valuing the 10000 times of sampling results of the n genes, constructing n value frequency distribution graphs for the n gene distribution, and taking the n value frequency distribution graphs as n original hypothesis distribution graphs corresponding to the n genes of the local sample of the kth subgroup

And 8: in n original hypothesis distributions of the kth group of local samples, the original valence distribution corresponding to the jth gene is the jth original hypothesis distribution of the n original hypothesis distributions. The elements V in the gene parameter matrix V' are combinedjkAnd comparing the value with the abscissa of the jth original hypothesis distribution, and taking the area on the right side of the comparison position of the distribution function as a test p value. And (3) carrying out error discovery rate correction on the detected p value by a Benjamini-Hochberg error discovery rate correction algorithm to obtain a corrected p value, if the corrected p value is less than 0.05, considering that the jth gene is the driving gene of the kth group of local samples, taking the value of j from 1 to n, taking the value of k from 1 to r, and processing according to the process to obtain the identification result of whether each gene is the driving gene of all local samples.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于生成对抗网络的药物作用后基因表达谱预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!