Method for identifying cell types and components in tissue samples based on gene expression profiles

文档序号:1088702 发布日期:2020-10-20 浏览:34次 中文

阅读说明:本技术 基于基因表达谱识别组织样本中细胞类型及组分的方法 (Method for identifying cell types and components in tissue samples based on gene expression profiles ) 是由 李华梅 赵小粼 刘宏德 于 2020-06-28 设计创作,主要内容包括:本发明涉及一种基于基因表达谱识别组织样本中细胞类型及组分的方法,包括1)获得基因表达矩阵中所有基因的特异性得分;2)利用获取得到的基因表达矩阵中所有基因的特异性得分并结合统计检验框架识别潜在的标记基因;3)利用互线性策略将识别的标记基因映射至标记基因所对应的细胞类型,并过滤掉低信度的标记基因,构建出可表征细胞类型特异性且具有最小条件数的标签矩阵;4)将加权最小二乘法纳入鲁棒线性模型,与标签矩阵相结合,构建解卷积模型,预测组织样本中的细胞组分。本发明提供了一种直接衡量基因在任意种条件下的特异性的方法,并建立了细胞类型识别算法,实现对细胞类型特异性基因的鉴定和组织样本中细胞组分的预测。(The invention relates to a method for identifying cell types and components in a tissue sample based on a gene expression profile, which comprises the following steps of 1) obtaining specificity scores of all genes in a gene expression matrix; 2) identifying potential marker genes by using the obtained specificity scores of all genes in the gene expression matrix and combining a statistical test framework; 3) mapping the identified marker genes to cell types corresponding to the marker genes by utilizing a mutual linearity strategy, filtering out marker genes with low reliability, and constructing a tag matrix which can represent the cell type specificity and has the minimum condition number; 4) and (3) incorporating a weighted least square method into the robust linear model, combining the robust linear model with the label matrix, constructing a deconvolution model, and predicting the cell components in the tissue sample. The invention provides a method for directly measuring the specificity of genes under any conditions, establishes a cell type recognition algorithm and realizes the identification of cell type specific genes and the prediction of cell components in a tissue sample.)

1. A method for identifying cell types and components in a tissue sample based on gene expression profiling, comprising: the method for identifying cell types and components in a tissue sample based on gene expression profiles comprises the following steps:

1) evaluating the specificity of genes in different cell types based on a cell type specificity scoring model of information entropy, and completely obtaining specificity scores of all genes in a gene expression matrix;

2) utilizing specificity scores of all genes in the obtained gene expression matrix obtained in the step 1) and combining a statistical test framework to identify potential marker genes;

3) mapping the marker genes identified in the step 2) to cell types corresponding to the marker genes by utilizing a mutual linearity strategy, filtering out the marker genes with low reliability, and constructing a tag matrix which can represent the cell type specificity and has the minimum condition number;

4) and (3) using weighted robust linear regression, namely incorporating a weighted least square method into a robust linear model, combining the robust linear model with a label matrix to construct a deconvolution model, and predicting cell components in the tissue sample.

2. The method of claim 1 for identifying cell types and components in a tissue sample based on gene expression profiling, wherein the method comprises the steps of: the specific implementation manner of the step 1) is as follows: the specificity score for each gene in the purified sample in k cell types was calculated using gene specificity formula (1) as follows:

wherein:

Si' is the specificity score;

Xijindicating the expression of the ith gene in the jth cell type;

is the mean of the ith gene expression for each cell type;

in order to make the specific score formula have better anti-noise capability, the weight converted by the tanh is blended into the specific score formula; the formula is as follows:

Si=tanh(λWi)·Si′ (2)

Figure FDA0002557450600000013

wherein:

Siis the final specificity score for the ith gene;

λ is an adjustment parameter, the default value of λ is 0.1;

Wiis the weight of the ith gene;

xi. is the expression level of the ith gene in k cell types;

xt. is the expression level of the t-th gene in k cell types;

g is the total number of genes.

3. The method of claim 2 for identifying cell types and components in a tissue sample based on gene expression profiling, wherein the method comprises the steps of: the specific implementation manner of the step 2) is as follows:

first, S is estimated by using a kernel density estimation method*Fitting normal distribution according to the center point of the distribution, determining the P value of each gene specificity score in the S through z test, and taking the gene with the P value less than or equal to 0.01 as a candidate marker gene; wherein

Figure FDA0002557450600000024

Figure FDA0002557450600000025

here, H0 and H1 represent the original hypothesis and the alternative hypothesis, respectively.

4. The method of claim 3 for identifying cell types and components in tissue samples based on gene expression profiling, wherein the method comprises the steps of: the specific implementation manner of the step 3) is as follows:

3.1) mapping marker genes to corresponding cell types; measuring a difference in gene expression of a specific cell type relative to the average expression of other cell types using the pi value in formula 5, and using a gene having the highest pi value among different cell types as a seed gene;

Xijindicating the expression of the ith gene in the jth cell type;

xi. is the expression level of the ith gene in k cell types;

3.2) calculating the mutual linearity of the candidate marker gene and the seed gene; a method based on a mutual linearity strategy combined with Monte Carlo sampling to map candidate markers to cell types; the formula is shown as (6);

Figure FDA0002557450600000022

wherein:

sgn (.) denotes a sign function;

ρijexpressing the mutual linear value of the gene i and the seed gene j;

rijis the correlation coefficient;

3.3) estimation step using Monte Carlo sampling 3.2) Each rhoijFirst, by using the equation to calculate the linearity of the non-candidate marker gene and the seed marker gene, a zero distribution for each cell type is obtained, see equation (7);

Figure FDA0002557450600000023

5. the method of claim 4 for identifying cell types and components in tissue samples based on gene expression profiling, wherein the method comprises the steps of: the specific implementation manner of the step 4) is as follows: the gene expression profile of the tissue sample is a convolution of the gene expression of the various cell types involved in the sample; estimating an unknown cell type score based on a feature matrix is described by linear regression, where m is f × B, where m is the expression of a large sample, B is a signature matrix, and f is a coefficient indicating the variation of m relative to B; deconvolution is carried out by using a robust linear model RLM which is more elastic to noise; incorporating a weighted least squares approach into the RLM; when the deconvolution model converges, the regression coefficients are extracted and the negative regression coefficient is set to 0, and then the remaining coefficients are normalized to sum to 1, yielding a vector representing the estimated cell fraction.

Technical Field

The invention relates to a cell type identification deconvolution method, in particular to a method for identifying cell types and components in tissue samples based on gene expression profiles.

Background

Gene expression profiles vary in different tissues or cell types, as well as in different developmental stages, physiological conditions, external stimuli, and pathological conditions, and these specifically expressed genes, also known as marker genes, can be used to determine cell identity and help understand the molecular mechanisms behind the disease. In addition, for gene expression data of a large number of samples, marker genes are key to accurate prediction of cellular components. In recent years, various deconvolution algorithms have been proposed, and most of the deconvolution algorithms (i.e., analysis of each cell type and its proportion from gene expression data of mixed cells) must know the marker genes of each cell type in advance, usually from a large number of experiments, and the calculation of the potential marker genes from the expression data has important significance for deconvolution of tissue samples.

The identification of cell type specific genes is an important prerequisite for deconvolution of mixed samples. In general, most strategies identify genes with significant changes by pairwise comparison and screen for genes with cell type specificity. However, the marker genes selected by this strategy based on pairwise comparison do not represent the specificity of gene expression under a variety (> 2) of conditions. In conclusion, a method for directly measuring the specificity of genes under any conditions is developed, and a cell type recognition algorithm is established, so that the method has important significance for disease mechanism research.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a method for directly measuring the specificity of genes under any conditions, establishes a cell type recognition algorithm and realizes the identification of cell type specific genes and the prediction of cell components in a tissue sample.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for identifying cell types and components in a tissue sample based on gene expression profiling, comprising: the method for identifying cell types and components in a tissue sample based on gene expression profiles comprises the following steps:

1) evaluating the specificity of genes in different cell types based on a cell type specificity scoring model of information entropy, and completely obtaining specificity scores of all genes in a gene expression matrix;

2) utilizing specificity scores of all genes in the obtained gene expression matrix obtained in the step 1) and combining a statistical test framework to identify potential marker genes;

3) mapping the marker genes identified in the step 2) to cell types corresponding to the marker genes by utilizing a mutual linearity strategy, filtering out the marker genes with low reliability, and constructing a tag matrix which can represent the cell type specificity and has the minimum condition number;

4) and (3) using weighted robust linear regression, namely incorporating a weighted least square method into a robust linear model, combining the robust linear model with a label matrix to construct a deconvolution model, and predicting cell components in the tissue sample.

Preferably, the specific implementation manner of step 1) adopted by the invention is as follows: the specificity score for each gene in the purified sample in k cell types was calculated using gene specificity formula (1) as follows:

wherein:

Si' is the specificity score;

Xijindicating the expression of the ith gene in the jth cell type;

Figure BDA0002557450610000022

is the mean of the ith gene expression for each cell type;

in order to make the specific score formula have better anti-noise capability, the weight converted by the tanh is blended into the specific score formula; the formula is as follows:

Si=tanh(λWi)·Si′ (2)

wherein:

Siis the final specificity score for the ith gene;

λ is an adjustment parameter, the default value of λ is 0.1;

Wiis the weight of the ith gene;

xi. is the expression level of the ith gene in k cell types;

xt. is the expression level of the t-th gene in k cell types;

g is the total number of genes.

Preferably, the specific implementation manner of step 2) adopted by the invention is as follows:

first, S is estimated by using a kernel density estimation method*Fitting normal distribution according to the center point of the distribution, determining the P value of each gene specificity score in the S through z test, and taking the gene with the P value less than or equal to 0.01 as a candidate marker gene; wherein

Figure BDA0002557450610000025

Is the mean of the background distribution (i.e., gaussian distribution) of the specificity score;

Figure BDA0002557450610000024

here, H0 and H1 represent the original hypothesis and the alternative hypothesis, respectively.

Preferably, the specific implementation manner of step 3) adopted by the invention is as follows:

3.1) mapping marker genes to corresponding cell types; measuring a difference in gene expression of a specific cell type relative to the average expression of other cell types using the pi value in formula 5, and using a gene having the highest pi value among different cell types as a seed gene;

Figure BDA0002557450610000031

Xijindicating the expression of the ith gene in the jth cell type;

xi. is the expression level of the ith gene in k cell types;

3.2) calculating the mutual linearity of the candidate marker gene and the seed gene; a method based on a mutual linearity strategy combined with Monte Carlo sampling to map candidate markers to cell types; the formula is shown as (6);

wherein:

sgn (.) denotes a sign function;

ρijexpressing the mutual linear value of the gene i and the seed gene j;

rijis the correlation coefficient;

3.3) estimation step using Monte Carlo sampling 3.2) Each rhoijFirst, by using the equation to calculate the linearity of the non-candidate marker gene and the seed marker gene, a zero distribution for each cell type is obtained, see equation (7);

preferably, the specific implementation manner of step 4) adopted by the invention is as follows: the gene expression profile of the tissue sample is a convolution of the gene expression of the various cell types involved in the sample; estimating an unknown cell type score based on a feature matrix is described by linear regression, where m is f × B, where m is the expression of a large sample, B is a signature matrix, and f is a coefficient indicating the variation of m relative to B; deconvolution is carried out by using a robust linear model RLM which is more elastic to noise; incorporating a weighted least squares approach into the RLM; when the deconvolution model converges, the regression coefficients are extracted and the negative regression coefficient is set to 0, and then the remaining coefficients are normalized to sum to 1, yielding a vector representing the estimated cell fraction.

Compared with the prior art, the invention has the following remarkable advantages: the method for identifying cell types and components in tissue samples based on gene expression profiles comprises the steps of 1) evaluating the specificity of genes in different cell types based on a cell type specificity scoring model of information entropy, and completely obtaining the specificity scores of all genes in a gene expression matrix; 2) utilizing specificity scores of all genes in the obtained gene expression matrix obtained in the step 1) and combining a statistical test framework to identify potential marker genes; 3) mapping the marker genes identified in the step 2) to cell types corresponding to the marker genes by utilizing a mutual linearity strategy, filtering out the marker genes with low reliability, and constructing a tag matrix which can represent the cell type specificity and has the minimum condition number; 4) and (3) using weighted robust linear regression, namely incorporating a weighted least square method into a robust linear model, combining the robust linear model with a label matrix to construct a deconvolution model, and predicting cell components in the tissue sample. The present invention encapsulates the aforementioned processes into a single R-packet: LinDeconseq. LinDeconseq has stronger cell type specific gene recognition capability, can reduce false positive markers of expressed genes, and also shows good prediction accuracy in cell type prediction.

Drawings

FIG. 1 is a schematic analysis flow chart of the present invention;

FIG. 2 is a heatmap of marker gene expression identified by LinDeconSeq for AML analysis.

FIG. 3 shows the results of deconvolution of the TCGA-AML sample and the healthy sample by LinDeconSeq.

FIG. 4 shows the results of the LinDeconSeq prognostic analysis in leukemia patients.

Detailed Description

The invention provides a method for identifying cell types and components in tissue samples based on gene expression profiles, which can identify cell specific marker genes and a deconvolution tool LinDeconSeq for identifying cell types, and the flow executed by the tool is specifically shown in figure 1. The process is divided into two stages. In phase 1, a set of marker genes are identified and assigned to cell types; in stage 2, the signature matrix (determined in stage 1) and weighted robust regression are used to predict the cellular components of the tissue sample.

Identifying cell types using LinDeconSeq involves the following steps:

1) and evaluating the specificity of the genes in different cell types based on the cell type specificity scoring model of the information entropy so as to completely obtain the specificity scores of all the genes in the gene expression matrix. The specificity score for each gene in the purified sample in k cell types was calculated using gene specificity formula (1) as follows:

Figure BDA0002557450610000041

wherein Si' is the specificity score, XijIndicates the expression of the ith gene in the jth cell type,is the mean of the i-th gene expression for each cell type. In order to make this specific scoring formula have better noise immunity, the weights subjected to tanh conversion are therefore incorporated into the specific scoring formula. The formula is as follows:

Si=tanh(λWi)·Si′ (2)

wherein SiIs the final specificity score for the ith gene, λ is the tuning parameter (default 0.1), WiIs the weight of the ith gene. Xi. is the expression level of the ith gene in k cell types; xt. is the expression level of the t-th gene in k cell types; g is the total number of genes.

2) The resulting gene expression specificity scores are used in conjunction with a statistical test framework to identify potential marker genes. First, S is estimated by using a kernel density estimation method*Center of distribution (specificity score of each gene, close to normal distribution), then fitting normal distribution according to the center point, determining P value of specificity score of each gene in S by z test, and making P value less than or equal toThe gene of 0.01 was considered as a candidate marker gene. WhereinIs the mean of the background distribution (i.e., gaussian distribution) of the specificity score.

Figure BDA0002557450610000052

3) And mapping the marker genes to the corresponding cell types by utilizing a mutual linearity strategy, filtering out marker genes with low reliability, and constructing a tag Matrix (Signature Matrix) which can represent the cell type specificity and has the minimum condition number. The higher the Pearson Correlation Coefficient (PCC) between cell types, the more marker genes shared between them, and therefore a method based on a mutual linearity strategy is proposed to assign candidate marker genes to corresponding cell types using P values estimated from the degree of co-linearity between other candidate marker genes of each cell type and the seed marker gene with the highest significance score (P value)ij≦ 0.05), otherwise, the marker genes with low confidence are filtered out.

A) mapping marker genes to corresponding cell types; ideally, the expression of cell type-specific genes is restricted to one cell type and has robust expression in different biological replicates of the same cell type. Thus, in theory, if a candidate marker gene is expressed in only a single cell type, it is likely to be a marker for that particular cell type. Based on this fact, pi (formula 5) values are used to measure the difference in gene expression of a particular cell type relative to the average expression of other cell types, and the gene with the highest pi value among the different cell types is used as the seed gene.

Figure BDA0002557450610000053

3, b) calculating the mutual linearity of the candidate marker gene and the seed gene; due to the complexity of gene expression and the close relationship between cell lineages, it is difficult to assign candidate marker genes to specific cell types. For example, a marker gene for one cell type may also be overexpressed in some other cell types, which becomes obscure when the gene is targeted to the cell type. Since marker genes belonging to the same cell type have similar expression patterns, they can be highly related or linear to each other. Therefore, a method based on a mutual linearity strategy combined with monte carlo sampling is proposed to map candidate markers to cell types. The formula is shown in (6).

Where sgn (.) represents a sign function, ρijRepresents the mutual linearity value, r, of the gene i and the seed gene jijIs the correlation coefficient.

3, c) for estimating each of the above pijUsing monte carlo sampling, which enables the testing of the original hypothesis that the candidate marker is indistinguishable from the background gene. The zero distribution for each cell type is first calculated by using the equation to calculate the linearity of the non-candidate marker genes and the seed marker genes, see equation (7).

4) And (3) using weighted robust linear regression (w-RLM), namely incorporating a weighted least square method into a robust linear model, and combining the robust linear model with a label matrix to construct a deconvolution model to predict cell components in the tissue sample. The gene expression profile of a tissue sample is considered to be a convolution of the gene expression of the various cell types involved in the sample. Since the main goal of deconvolution is to estimate the unknown cell type score based on the feature matrix, it can also be described by linear regression, where m is f × B, where m is the expression of a large sample, B is the signature matrix, and f is a coefficient indicating the variation of m relative to B. Here, deconvolution is performed using a Robust Linear Model (RLM) that is more resilient to noise. To further eliminate the bias of the estimated fraction on cell type, a weighted least squares approach was incorporated into RLM (w-RLM). The weighted least squares approach can adjust the contribution of each gene in the optimal solution to mitigate bias due to imbalances in gene expression levels. In other words, if the average expression level of a gene is low, its contribution may be small. Therefore, the method has a positive effect on eliminating the prediction bias. When the deconvolution model converges, regression coefficients are extracted and negative regression coefficients are set to 0, and then the remaining coefficients are normalized to sum to 1, thereby producing a vector representing the estimated cell fraction.

To verify the accuracy of the invention, three standard datasets (GSE64098, GSE19830, GSE65133) were collected for algorithmic evaluation, the cell types covered by the three datasets and the corresponding cell proportions in each sample being known. The accuracy of the algorithm is evaluated using the pearson correlation coefficient (r) between the predicted result and the true scale, and the root mean square error (RMSD). LinDeconseq is able to predict cellular components in tissue samples more accurately than CIBERSORT and lsfit (tables 1-3). In addition, to validate the potential of LiNDeconSeq for use in clinical data, the Acute Myeloid Leukemia (AML) data set GSE74246, which includes 49 RNA-seq samples purified by Fluorescence Activated Cell Sorting (FACS), encompassing 13 human blood cell types, was analyzed using LiNDeconSeq. After functional annotation of the marker genes mapped to 13 cell types, it can be seen that these markers can specifically reflect the functions of different cells (see fig. 2 a, fig. 2 is the expression of the characteristic genes of 13 cell types in blood, and the right side is the functional annotation of each group of genes). Further, comparing the LinDeconSeq with other two marker gene recognition algorithms, it was found that LinDeconSeq has better recognition ability and anti-noise property for the marker genes (see FIGS. 2B-D). In addition, the tag matrix deduced based on LinDeconSeq exhibited strong cell type specificity, which is the basis for subsequent deconvolution (A is the expression of the beacon matrix as shown in FIG. 3). In particular, LinDecon seq was found to have good ability to distinguish cell types of selected marker genes by clustering all genes as well as marker genes using the t-SNE algorithm (B and C in FIG. 3, B and C are t-SNE clustering results of all genes and marker genes, respectively, and each scatter represents a FACS-purified cell sample). These analyses adequately reflected the rationality and effectiveness of the LinDeconseq recognition marker gene.

TABLE 1 comparison of prediction results of different algorithms in data set GSE64098

Figure BDA0002557450610000062

Figure BDA0002557450610000071

TABLE 2 comparison of prediction results of different algorithms in data set GSE19830

Figure BDA0002557450610000072

TABLE 3 comparison of prediction results in data set GSE65133 for different algorithms

To further study the deconvolution performance of LinDeconseq, AML disease samples in the TCGA database were introduced, and the prediction of the cellular fraction in TCGA-AML was performed using LinDeconseq and the tag matrix deduced in step (3), and the results showed that AML patients had high heterogeneity in the cellular fraction (D in FIG. 3, D is shown as the cellular fraction of 179TCGA-AML patients, each representing a sample, and each color representing a specific cell type). Fig. 3E (predicted cell scores for TCGA-AML patients for LinDeconSeq and CIBERSORT, each point representing a specific cell type in the sample), LinDeconSeq and CIBERSORT were compared for higher PCC and good consistency by calculating the pearson correlation coefficient (PCC, r) between cell scores by LinDeconSeq and CIBERSORT). This indicates that LinDeconSeq can accurately predict cellular components in tissue samples. In addition, the cellular components were found to have potential value for the diagnosis of AML disease by comparing the cellular components of healthy and AML samples in 13 cell types, F in FIG. 3 (F is the fraction of 13 major blood cell types in AML and healthy samples, each scatter in each group represents a fraction of a particular cell type, the bold line in the box represents the median, and the bottom and top of the box are the 25 th and 75 th percentiles (interquartile range)), showing that most of them have significant differences. Three diagnostic models were constructed using cell fractions predicted by LinDeconseq and characteristic curves (ROC) were plotted for the different models, and G in FIG. 3 (G is a ROC curve based on predicted values of different cell components of AML and healthy samples) shows very high accuracy in diagnosing AML using cell components. Therefore, these above show that the cell components predicted by LinDeconSeq can well reveal the differences of individuals in different states.

In addition, the cell components predicted by LinDeconseq have important application prospects for identification of disease subtypes. Using the PAM classification algorithm for the TCGA-AML cell fraction, two potential AML subtypes of AML disease were obtained, and a in figure 4 (a is a heat map of the cell type fraction in a subgroup in the TCGA-AML sample) indicates that these two subtypes differ significantly in part of the cell fraction. Such as granulocyte-monocyte progenitor cells (GMP). GMP cells play an important role in distinguishing two subtypes of species. The two subtypes were found to have significant differences in survival time as determined by prognostic analysis, which was well documented in the TARGET-AML data (B-E in FIG. 4, where B is the Kaplan-Meier curve for the overall survival of the two subgroups of AML from 179TCGA-AML samples; C is the correlation coefficient heat map between cellular components and differentially expressed genes in the 179TCGA-AML samples; D is the Kaplan-Meier curve for the overall survival of the two subgroups as predicted on the TARGET-AML data using a random forest classifier; E is the distribution of the components of the 13 major blood cell types in the predicted subgroup of TARGET-AML samples). These analyses adequately reflect the important value of LinDeconSeq in clinical applications.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:致病基因位点数据库及其建立方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!