Model for predicting cell proliferation activity by using 87 genes as biomarkers

文档序号:1129365 发布日期:2020-10-02 浏览:32次 中文

阅读说明:本技术 以87个基因作为生物标志物预测细胞增殖活性的模型 (Model for predicting cell proliferation activity by using 87 genes as biomarkers ) 是由 吴超 郑敏 于 2020-06-17 设计创作,主要内容包括:本发明提供以87个基因作为生物标志物预测细胞增殖活性的模型。细胞增殖基因集合表达水平与细胞的增殖活性正相关。本发明提供了一套无需体外培养对细胞增殖活性进行评估的方法。结合单细胞测序技术,可以快捷简便的测定体内各细胞类型的增殖活性。本发明可以帮助我们判断癌症组织中是否存在显著增殖的正常细胞。当癌症组织中存在大量该类细胞时,针对细胞增殖标志物的治疗与评估手段将会受到干扰而可能失败,当癌症组织不存在大量该类细胞时,针对细胞增殖标志物的治疗与评估手段有望成功。本发明对于基于细胞增殖机制的癌症诊疗具有辅助指导意义。(The present invention provides a model for predicting cell proliferation activity using 87 genes as biomarkers. The expression level of the cell proliferation gene set is positively correlated with the proliferation activity of the cells. The invention provides a method for evaluating the cell proliferation activity without in vitro culture. By combining with a single cell sequencing technology, the proliferation activity of each cell type in vivo can be rapidly and simply determined. The present invention can help us judge whether there are significantly proliferating normal cells in cancer tissue. When a large number of such cells are present in the cancer tissue, the treatment and evaluation means for the cell proliferation marker may be disturbed and may fail, and when a large number of such cells are not present in the cancer tissue, the treatment and evaluation means for the cell proliferation marker is expected to be successful. The invention has auxiliary guiding significance for cancer diagnosis and treatment based on a cell proliferation mechanism.)

1. A model for predicting cell proliferation activity using 87 genes as biomarkers, comprising the steps of:

(1) establishing a cell proliferation gene set:

1) data acquisition

Obtaining single-cell RNA-Seq data of different types of normal cells from a Tabula Muris database, obtaining cancer and para-cancer tissue RNA-Seq data from a cancer genomic profile database, obtaining tissue RNA-Seq data from a GTEx database, and obtaining cell line RNA-Seq data and cell proliferation activity data from a CCLE database;

2) stem/group cell specific expression gene set mining

a) Classifying in vivo normal single cells in Tabula Muris database into 81 classes according to cell types, calculating gene expression values of various cells, and calculating the expression value (X) of a certain gene j in a certain cell type iji) The following were used:

wherein m is the total number of cells belonging to the cell type i, n is the number of cells with the reads count of the cell gene j being more than 0 in the cell type i, the expression values of all the genes in the cell type i are calculated, and the expression values of all the genes of 81 cell types are calculated in sequence;

b) the 81 types of cells were divided into two groups: stem/group cell group and other cell group;

c) using hierarchical clustering analysis to excavate genes with high expression in stem/group cell groups and extremely low expression in other cell groups as a stem/group cell specific gene set;

3) cell proliferation gene set mining

a) Obtaining the expression values of genes in a stem/group cell specific gene set in each normal tissue sample in a GTEx database, wherein terminal cells without proliferation activity in most normal tissues occupy main components, and therefore, performing hierarchical clustering analysis on the genes to obtain a gene group consisting of 87 genes with low expression in the normal tissues;

b) obtaining the expression values of the 87 genes in the TCGA database cancer and the tissue samples beside the cancer, and calculating the Z-score standardized gene expression value Y of a certain gene j in all the cancer and the tissue samples beside the cancerjFor a sample k, the expression vectors of 87 genes are listed as { Y }1k,Y2k,…,Y87kCalculating the expression value of the gene set to be the median of 87 gene expression vectors, further comparing the gene set expression value of each cancer sample with the gene set expression values of all paracancer samples by using a T test, further confirming that the gene set is highly expressed in cancer tissues and is lowly expressed in the paracancer tissues as most of the cancer tissues are composed of cancer cells with high proliferation, and confirming that a gene group composed of the 87 genes is a cell proliferation gene set;

(2) using the above cell proliferation gene set to establish a model for predicting cell proliferation activity:

1) prediction of in vitro cultured cancer cell line proliferation activity by cell proliferation gene set

a) Obtaining the expression value of the genes in the cell proliferation gene set in each cancer cell line in the CCLE database, and calculating the Z-score standardized gene expression value Z of a certain gene j in all cell line samplesjFor a cell line sample k, the expression vector of 87 genes is listed as { Z }1k,Z2k,…,Z87kCalculating the expression value of the gene set as the median of the 87 gene expression values, and calculating the expression value of the cell proliferation gene set of each cell line sample;

b) obtaining partial cell proliferation activity data in a CCLE database;

c) performing Pearson correlation analysis on the cell proliferation gene set expression value data of the cell line sample and the doubling time data of the corresponding cell line, confirming that the cell proliferation activity and a cell proliferation gene set expression value consisting of 87 genes have obvious positive correlation in the cancer cell line derived from the solid tumor, and predicting the proliferation activity of the cancer cell line derived from the solid tumor through the cell proliferation gene set expression level;

2) establishing a cell proliferation activity prediction model

a) Classifying the single cells in the Tabula Muris database into 81 types according to cell types, and obtaining the gene expression values of various cells as above;

b) carrying out hierarchical clustering analysis on 81 cell types by using the expression values of the 87 genes, and clustering the cell types into 2-3 types through clustering analysis;

c) calculating the expression value of cell proliferation gene set of each cell type of 81 cell types, obtaining the expression values of 87 genes in the cell proliferation gene set of each cell type, and obtaining the gene expression value X of a certain gene j of a certain cell type ijiThe expression vectors of 87 genes are listed as { X1i,X2i,…,X87iCalculating the expression value of the cell proliferation gene set as the median of the 87 gene expression values;

d) according to the result of cluster analysis, 81 cell types are clustered into 2-3 different cell type groups, expression value vectors of cell proliferation gene sets of each cell type group are obtained, the expression values of the cell proliferation gene sets of the different cell type groups are compared, whether the expression value of the cell proliferation gene set of a certain cell type group is significantly higher than that of other cell type groups is judged by taking P <0.05 as a threshold value, and therefore the cell type of a high-expression cell proliferation gene and the cell type of a low-expression cell proliferation gene are confirmed, and the proliferation activity of the 81 cell types is evaluated.

2. The model of claim 1 for predicting cell proliferation activity using 87 genes as biomarkers, wherein the 87 genes are: ANLN, ARHGAP11, ASF1, ATAD, AURKA, AURKB, BIRC, BRCA, BUB1, CCNA, CCNB, CDC, CDCA, CDK, CDT, CENPA, CENPE, CENPF, CENPH, CENPK, CENPM, CENPW, CEP, CKAP2, CLSPN, DBF, DLGAP, ESCO, FEN, FOXM, HIRIP, HIST1H2, HMMR, KIF20, KIF, KIFC, LMNB, LRWD, LRMAD 2L, GAP, MKI, NCAPG, NCAPH, NDC, NEIL, NUDT, NUF, SAP, PBK, MYT, PRK, RACK, RAC, GAP, TKI, NCAPG, NCAPH, NDC, NEIL, NUDT, SAP, PRK, PRB, PRC, PRK, PRC, PRBC.

3. The model of claim 1 for predicting cell proliferation activity using 87 genes as biomarkers, wherein the database Tabula Muris:https://tabula-muris.ds.czbiohub.org/(ii) a Cancer genomic profile database:http://cancergenome.nih.gov/(ii) a GTEx database: https:// www.gtexportal.org/; from the CCLE database:https://portals.broadinstitute.org/ccle

4. the model of claim 1, wherein the data on the proliferation activity of a portion of cells in the CCLE database obtained in step (2) is obtained by obtaining the data on the proliferation activity of a portion of cells in the CCLE database at doubling time, and the comparison of the expression values of the sets of cell proliferation genes of different cell types is performed by using T test.

Technical Field

The invention belongs to the field of gene technology and biomedicine, and particularly relates to a method for predicting cell proliferation activity by taking 87 genes as biomarkers

Background

The massive disordered proliferation of cancer cells is a key mechanism of tumorigenesis. In view of the cell proliferation mechanism, therapeutic means such as chemotherapy have been developed. Meanwhile, a plurality of cell proliferation gene markers such as MKI67, MCM2, PCNA and the like are developed, and the mRNA or protein expression level of the markers is used for indicating the proliferation activity of cancer cells, so that the prognosis condition of postoperative patients can be evaluated in an auxiliary manner. Particularly aiming at the protein expression quantity of MKI67, a Ki-67 index is developed to mark the ratio of Ki-67 expression positive cells in a pathological sample, so as to evaluate the prognosis of cancer patients such as lung cancer, breast cancer, prostate cancer, cervical cancer, colorectal cancer, bladder cancer, lymph cancer and the like.

Proliferation is not a unique property of cancer cells. It has been shown that a large number of cells with proliferative activity are present in human skin, bone marrow and gastrointestinal tissues. When cancer occurs in the above tissues, the expression level of cell proliferation markers such as MKI67 in the cancer tissue sample of the patient after the operation is partially derived from cancer cells, and partially derived from normal proliferating cells, will not accurately reflect the proliferation activity of cancer cells. Absent adequate data support, the American Society for Clinical Oncology (ASCO) tumor marker guideline committee does not suggest a Ki-67 index as a general prognostic marker for patients newly diagnosed with breast cancer. This phenomenon is partly caused by the fact that a large number of proliferating cells are present in immune organs such as normal bone marrow and lymph nodes, and the Ki-67 index cannot accurately distinguish normal proliferating cells from tumor cells in pathological samples of patients, which results in a decrease in the accuracy of estimating the proliferation activity of cancer cells, and thus in a decrease in the ability to predict the prognosis of patients.

In vitro culture can help us identify the proliferative capacity of normal cells. However, this method has great difficulties: 1. part of the cells can not be cultured in vitro; 2. due to the huge difference of living environments of part of cells, the proliferation capacity of the cells under the in vitro culture condition cannot reflect the real proliferation capacity in the in vivo environment.

Disclosure of Invention

Aiming at the difference of the proliferation activities of different types of cells in a human body and the difficulty of the evaluation of the proliferation activities of the cells in the current culture mode, the invention provides a method for evaluating the proliferation activities of the cells by taking 87 cell proliferation gene sets as markers. In order to achieve the purpose, the invention adopts the following technical scheme.

1. Establishing a cell proliferation gene set, wherein the cell proliferation gene set consists of 87 genes, and the specific implementation steps are as follows:

(1) data acquisition

Single-Cell RNA-Seq data for different types of normal cells were obtained from the Tabula Muris database (https:// Tabula-Muris. ds. czbiohub. org.), cancer and paracancerous tissue RNA-Seq data were obtained from the cancer genome map (TCGA) database (http:// cancer. nih. gov.), tissue RNA-Seq data were obtained from the GTEx (Genotype-tissue expression Project) database (https:// www.gtexportal.org /), Cell Line RNA-Seq data and Cell proliferation activity data were obtained from the CCLE (cancer Cell Line encyclopedia) database (https:// portals. branched. org/cc).

(2) Stem/group cell specific expression gene set mining

a) Classifying the in-vivo normal single cells in the Tabula Muris database into 81 types according to cell types, and calculating the gene expression values of various cells. Calculating the expression value (X) of a gene j in a specific cell type iji) The following were used:

Figure BDA0002543800860000021

where m is the total number of cells belonging to cell type i and n is the number of cells in cell type i for which the reads count of cell gene j is greater than 0. Thus, the expression values of all genes in cell type i were calculated. In turn, expression values for all genes of 81 cell types were calculated.

b) The 81 types of cells were divided into two groups: stem/group cell group and other cell groups.

c) And (3) mining genes with high expression in the stem/group cell group and low expression in other cell groups as a stem/group cell specific gene set by using hierarchical clustering analysis.

(3) Cell proliferation gene set mining

a) And obtaining the expression value of the genes in the stem/group cell specific gene set in each normal tissue sample in the GTEx database. The end cells having no proliferation activity in most normal tissues occupy the major component, and for this purpose, the above genes are subjected to hierarchical clustering analysis to obtain a gene group consisting of 87 genes (87 genes including ANLN, ARHGAP11, ASF1, ATAD, AURKA, AURKB, BIRC, BRCA, BUB1, CCNA, CCNB, CDC, CDCA, CDK, CDT, CENPA, CENPE, CENPF, CENPH, CENPK, CENPM, CENPW, CEP, CKAP2, CLSPN, DBF, DLGAP, ECT, ESCO, FEN, FOXM, HIRIP, HIST1H2, HMMR, KIF20, KIF, KIFC, LMNB, RRM, MAD2L, RRM, TARACK, TAMPK, PRMCK, PR.

b) The expression values of the above 87 genes in TCGA database cancer and paracarcinoma tissue samples were obtained. For a certain gene j (j is more than or equal to 1 and less than or equal to 87), calculating the Z-score standardized gene expression value Y in all cancer and paracarcinoma tissue samplesj. For a sample k, the expression vectors of 87 genes are listed as { Y }1k,Y2k,…,Y87kThen, the expression value of the gene set was calculated as the median value of the above 87 gene expression vectors (mean { Y)1k,Y2k,…,Y87k}). The T-test was further used to compare the gene set expression values of the samples for each cancer with the gene set expression values of all paracancerous samples. Since most cancer tissues are composed of cancer cells that proliferate highly, it was confirmed that the gene set is highly expressed in cancer tissues and less expressed in the vicinity of cancer, and thus, it was confirmed that the gene group composed of 87 genes is a cell proliferation gene set.

2. Using the above cell growth gene setEstablishingThe model for predicting the cell proliferation activity comprises the following specific implementation steps:

(1) prediction of in vitro cultured cancer cell line proliferation activity by cell proliferation gene set

a) Expression values for genes in the set of cell proliferation genes in each cancer cell line in the CCLE database were obtained. Similarly, for a gene j (1. ltoreq. j. ltoreq.87), the Z-score normalized gene expression value Z of the gene in all cell line samples is calculatedj. For a cell line sample k, columnBy taking the expression vector of 87 genes as { Z1k,Z2k,…,Z87kThen, the expression value of the gene set was calculated as the median of the expression values of the above 87 genes (mean { Z)1k,Z2k,…,Z87k}). And calculating the expression value of the cell proliferation gene set of each cell line sample.

b) Partial cell proliferation activity data (doubling time) were obtained in the CCLE database.

c) Pearson correlation analysis was performed on the expression value data of the cell growth gene set of the cell line sample and the doubling time data of the corresponding cell line. It was confirmed that in cancer cell lines derived from solid tumors, there was a significant positive correlation between the cell proliferation activity and the expression value of a cell proliferation gene set consisting of 87 genes, i.e., the level of expression of the cell proliferation gene set was predictive of the proliferation activity of cancer cell lines derived from solid tumors.

(2) Establishing a cell proliferation activity prediction model

a) The single cells in the Tabula Muris database are classified into 81 types according to the cell types, and the gene expression values of various cells are obtained as above.

b) Hierarchical clustering analysis was performed on 81 cell types using the above expression values of 87 genes. Cell types were clustered into 2-3 classes by cluster analysis.

c) And calculating the expression value of the cell proliferation gene set of each cell type in 81 cell types, and obtaining the expression values of 87 genes in the cell proliferation gene set in each cell type. For a certain cell type i, for a certain gene j (the gene expression value X is more than or equal to 1 and less than or equal to 87jiThe expression vectors of 87 genes are listed as { X1i,X2i,…,X87iThen, the expression value of the cell proliferation gene set was calculated as the median of the expression values of the above 87 genes (mean { X)1i,X2i,…,X87i)。

d) According to the results of the cluster analysis, 81 cell types are clustered into 2-3 different cell type groups, for each cell type group, an expression value vector of a cell proliferation gene set is obtained, and the expression values of the cell proliferation gene sets of the different cell type groups are compared (T test, double-tailed). And judging whether the expression value of the cell proliferation gene set of a certain cell type group is significantly higher than that of other cell type groups by taking P <0.05 as a threshold value, so as to confirm the cell type of the high-expression cell proliferation gene and the cell type of the low-expression cell proliferation gene and realize the evaluation of the proliferation activity of 81 cell types.

To this end, the evaluation of the proliferative activity of 81 normal cell types in vivo was achieved using the expression levels of the cell proliferation gene set.

According to the invention, a cell proliferation related gene marker set consisting of 87 genes is identified through single cell RNA-Seq data, and by using the set, the proliferation activities of different normal cell types in vivo are evaluated, so that the normal cell types in vivo proliferated at a high speed are comprehensively identified. The realization of this technology can help us to judge whether normal cells with proliferative activity exist in cancer tissues. When a large number of such cells are present in cancer tissue, treatment and evaluation approaches for cell proliferation markers will be disturbed and may fail.

The method has the advantages that (1) the method for judging the cell proliferation activity based on culture needs to culture normal tissue cells in vitro, at present, part of tissue cells cannot be cultured in vitro, and the proliferation activity of the in vivo and in vitro cells of part of tissue cells influenced by culture conditions has great difference. (2) The result of the normal cell proliferation activity obtained by the method can assist in judging whether a large number of normal cells with proliferation capacity exist in the cancer tissue, thereby providing guidance for cancer treatment and evaluation means aiming at a cell proliferation mechanism.

Drawings

FIG. 1: clustering analysis of high-expression genes in stem/group cell groups heatmaps. In the figure, one column represents one cell type and one row represents one gene. Clustering analysis is carried out on genes with expression level >0.5 in any cell type in the stem/group cell group, 15 gene groups are formed, and a gene group consisting of 162 genes is found, wherein the genes are obviously expressed in the stem/group cell group and are extremely low expressed in other cell types. Epi-SC in the figure indicate epidermal stem cells, numbers 1-7 indicate Slamf1 positive pluripotent group cells (1), megakaryo-erythroid progenitor cells (2), late B precursor cells (3), granulomonocytic group cells (4), granulomatoid cells (5), lymphoid progenitor cells (6), and natural killer precursor cells (7), and these 8 types of cells constitute the stem/group cell group.

FIG. 2: cluster analysis heatmap of stem/group cell specific expression gene set genes in 54 human normal tissue samples. In the figure, one column represents one sample, one row represents one gene, and samples of the same color belong to the same tissue type. The stem/group cell specific expression gene set genes were clustered into 2 gene groups in 17382 samples of 54 human normal tissues. The gene group formed by the aggregation of 87 genes was found to be highly expressed only in (1) the skin fibroblast (cubturedskin fibroblasts) after culture, (2) the EBV-transfected lymphocytes (EBV-transformed lymphocytes) and (3) the testis tissue (testis) tissues, and to be low expressed in all the other tissues.

FIG. 3: expression levels of sets of cell proliferation genes in different cancers are boxed. Expression level values of cell proliferation gene sets were obtained from 9630 samples of 32 kinds of cancer and paracarcinoma tissues, and then all paracarcinoma samples were pooled (Control). The expression levels of the cell proliferation gene set were compared for each cancer and Control using the t-test. The expression level of the cell proliferation gene set of the red highlight is obviously higher than that of the cancer type of a paracarcinoma group by taking the double-tail P-value <0.05 as an index.

FIG. 4: and (4) carrying out correlation analysis on the expression level of the cell proliferation gene set and the optimal doubling time of the cell. Each point in the figure represents a cell line, with the abscissa representing the expression level of the cell proliferation gene set of the cell line and the ordinate indicating the doubling time of the cell line (supplied by the supplier). And (4) calculating the Pearson correlation coefficient and P-value of the expression level of the cell proliferation gene set and the optimal doubling time of the cells.

FIG. 5: cluster analysis heatmaps of 81 different normal cell types in Tabula Muris. In the figure, one column represents one cell type and one row represents one gene in one cell proliferation gene set. The 81 normal cell types were grouped into three classes according to the expression levels of the genes in the cell proliferation gene set.

Detailed Description

The invention is described in detail below with reference to the drawings and examples, which are only preferred embodiments of the invention, and it should be noted that a person skilled in the art may make several modifications and additions without departing from the method of the invention, and these modifications and additions should also be regarded as the scope of protection of the invention.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种颗粒物监测数据的处理方法、装置、存储介质及终端

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!