Drug relocation method based on differential expression data

文档序号:1075096 发布日期:2020-10-16 浏览:11次 中文

阅读说明:本技术 基于差异表达数据的药物重定位方法 (Drug relocation method based on differential expression data ) 是由 鱼亮 何丹 于 2020-06-28 设计创作,主要内容包括:本发明提出一种基于差异表达数据的药物重定位方法,主要解决现有技术不筛选差异基因且预测药物准确度低的问题,其方案为:获取基因数据,计算基因表达变化值及其显著值,筛选出差异表达显著的基因集合;利用该基因集合和致病基因构建疾病的两种查询基因集合;分别获取这两种基因集合与药物的相关性值及其对应的药物预测准确度;对比这两种基因集合预测的药物准确度;计算筛选后的基因集合与标准正态分布的差异值,根据该差异值和计算的表达阈值构建疾病最优查询基因集合的预测模型;使用该模型预测其他疾病的基因集合并计算预测药物的准确度,筛选出具有潜在治疗效果的候选药物。本发预测准确度高,可用于预测对疾病具有潜在治疗效果的候选药物。(The invention provides a drug relocation method based on differential expression data, which mainly solves the problems that differential genes are not screened and the accuracy of drug prediction is low in the prior art, and the scheme is as follows: acquiring gene data, calculating a gene expression change value and a significant value thereof, and screening out a gene set with significant differential expression; constructing two inquiry gene sets of diseases by using the gene set and the pathogenic genes; respectively acquiring correlation values of the two gene sets and the medicines and corresponding medicine prediction accuracy; comparing the accuracy of the predicted drugs of the two gene sets; calculating a difference value between the screened gene set and the standard normal distribution, and constructing a prediction model of the optimal disease query gene set according to the difference value and the calculated expression threshold value; the model is used for predicting gene sets of other diseases, calculating the accuracy of the predicted drugs and screening out candidate drugs with potential treatment effects. The prediction accuracy is high, and the method can be used for predicting candidate drugs with potential treatment effects on diseases.)

1. A method of drug relocation based on differential expression data comprising the steps of:

(1) downloading pathogenic gene data by using an OMIM database, and searching a pathogenic gene set L of the disease;

(2) downloading gene expression data, calculating each gene expression change value logFC and corresponding false positive FDR, and screening genes;

(3) two query gene sets were constructed:

(3a) deleting the top m genes and the bottom k genes in the screened gene set with obvious differential expression to obtain a disease de-noising differential gene set A, wherein k is more than or equal to 0, and m is more than or equal to 0;

(3b) sorting the pathogenic gene set L from small to large according to the logFC sequence of each gene, adding the gene with the logFC <0 to the top of the disease de-noising difference gene set A, and adding the gene with the logFC >0 to the bottom of the disease de-noising difference gene set A to obtain a disease gene set B of the difference and pathogenic genes;

(4) respectively calculating correlation value vectors AS of the difference gene set A and each medicine for denoising diseasesAAnd correlation value vector AS of disease gene set B of difference and pathogenic gene and each medicineB

(5) Calculating the accuracy P of the predicted drug of the denoised differential gene set AAAnd calculating the difference and the accuracy P of the disease gene set B for predicting the medicineB

(6) Comparing the accuracy P obtained in the step (5)AAnd PB

If PBHigher than PAIncreasing the values of the parameters k and m, deleting more noises, and repeating the steps (3) - (5) until PB≤PA(ii) a Otherwise, executing the step (7);

(7) recording the current difference gene set A of denoising the disease as an optimal inquiry gene set A 'of the disease, and determining a threshold value T at two ends of the optimal inquiry gene set A' of the diseaseupAnd Tdown

(8) Setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number as 20, and judging whether the sum of the current top threshold number and the current bottom threshold number is lower than the minimum value of 20: if yes, continuing to download other disease data from the TCGA database, and obtaining two end threshold values of newly downloaded diseases by adopting the steps (3) - (7)Andexecuting (9) until the total threshold number sum is not less than 20; otherwise, directly executing (9);

(9) constructing a prediction model of the optimal difference gene set threshold of the disease:

(9a) randomly taking 100000 numerical points from standard normal distribution as background values, recording the numerical points as vectors E, using Kolmogorov-Smirnov test statistic J to represent distribution difference between differential expression value of disease genes and 100000 background points, using the distribution difference as independent variable X of model, and using threshold values T at two ends of optimal query gene set of all diseasesupAnd TdownA dependent variable Y as a model;

(9b) according to the parameters obtained in the step (9a), constructing a prediction model as a linear model of | Y | ═ aX + b, wherein a is the slope of the model, and b is the intercept of the model;

(10) predicting the optimal query gene set of other diseases by using a prediction model, and calculating the accuracy P of M drugs before drug ranking through (4) - (5)MAnd then selecting the M medicines to obtain all candidate medicines with potential treatment effects on diseases, wherein M represents the number of medicines and is more than or equal to 10.

2. The method according to claim 1, wherein the expression change value logFC of each gene is calculated in (2) by the following formula:

Figure FDA0002556935160000021

wherein e islThe expression value of the gene in the sample l is shown, l is the sample number in the gene expression data, when S +1 is not less than l and not more than S + W, the sample l is the cancer patient sample, when S +1 is not less than l and not more than S + W, the sample l is the normal human sample, S is the number of the cancer patient samples, and W is the number of the normal human samples.

3. The method of claim 1, wherein the false positive FDR for gene correspondence is calculated in (2) as follows:

(2a) constructing an expression value vector θ of the gene from the gene expression data:

θ=(e1,e2,…,el,…,eS,eS+1,eS+2,…,eS+f,…,eS+W),

wherein l represents the sample number of the cancer patient in the gene expression data, elExpressing the expression value of the gene in the cancer sample l, and 1. ltoreq. l.ltoreq.S, S representing the number of cancer patient samples, es+fExpressing the expression value of the gene in a normal human sample S + f, wherein S + f is more than or equal to 1 and less than or equal to W, f represents the sample number of the normal human in the gene expression data, W represents the number of normal human samples, and S + W represents a baseTotal number of samples due to expression data;

(2b) randomly scrambling the serial numbers of the samples in the expression value vector theta of the gene to obtain the expression value vector labeled by the scrambled samplesFor vector theta*Performing random test, and calculating theta in the v random test*Is calculated as a statistical result value Hv

Figure FDA0002556935160000032

(2c) Repeating the step (2b) for 1000 times, and calculating the statistical significance value F of each gene expression change value:

Figure FDA0002556935160000033

wherein C represents the checking statistic value, and P represents theta in the random test*Is higher than the set of C, | P { H |V|HVV is more than C and more than or equal to 1 and less than or equal to 1000, and represents the number of elements in the set P;

(2e) sorting the F from small to large, calculating the false positive FDR of the gene:

where N represents the total number of genes and i represents the ordering of the genes.

4. The method according to claim 1, wherein the screening of the gene from the gene expression change value logFC and the false positive FDR corresponding to the gene in (2) comprises setting a threshold for differential change of the gene expression change value logFC and a threshold for false positive FDR as α, respectively, and comparing the gene expression change value logFC and its false positive FDR with their respective thresholds to screen the gene with | logFC | ≧ and FDR ≦ α as the gene whose differential expression is significant.

5. The method according to claim 1, wherein (4) the correlation value vector AS of the denoised differential gene set A and each drug is calculated separatelyAAnd correlation value vector AS of disease gene set B of difference and pathogenic gene and each medicineBThe implementation is as follows:

(4a) taking a difference gene set A subjected to disease denoising AS a query list of a Kolmogorov-Smirnov method, taking an ordered list of genes under the action of each drug AS a reference list set in the Kolmogorov-Smirnov method, and calculating a correlation value vector AS of the disease query gene set A and all drugsA

(4a1) Constructing a position vector V (1.. n) of each gene in the reference list set, wherein n represents the number of genes in the reference list;

(4a2) query logFC in list>0 as a query list Q of Kolmogorov-SmirnovupWill query logFC in the list<0 as a Down-Regulation query List Q of Kolmogorov-SmirnovdownSeparately, for each drug, a set of query genes Q is computationally investigatedupAnd downregulation of the query Gene set QdownEnrichment score of ESupAnd ESdown

Figure FDA0002556935160000042

Wherein the content of the first and second substances,a score representing the gene ranked top in both the query list and the reference list,

denotes the score of a gene before the query list and after the reference list, p denotes the gene rank, Vup(p) represents the position of the genes ranked as p in the upper query list in the reference list, s1 represents the number of genes in the upper query list,

Figure FDA0002556935160000045

Figure FDA0002556935160000046

(4a3) calculating the correlation value as of the difference gene set A and each medicine for denoising the diseaseAIf ES in (4c)upAnd ESdownThe same sign, then asA0, otherwise, asA=ESup-ESdownAs for all drugs by the following formulaAAnd (3) score normalization:

Figure FDA0002556935160000047

wherein the content of the first and second substances,represents the correlation value of the jth drug and the disease, j is more than or equal to 1 and less than or equal to q, q represents the total number of the drugs,

Figure FDA0002556935160000049

(4a4) combining the normalized values of the drugs and the diseases to obtain a correlation value vector:

Figure FDA0002556935160000051

(4b) calculating correlation value vector AS of disease gene set B of difference and pathogenic genes and all medicinesB

(4b1) The disease gene set B is used as a query list of a Kolmogorov-Smirnov method, an ordered list of genes under the action of each medicament is used as a reference list set in the Kolmogorov-Smirnov method, and the normalized correlation values of the disease gene set B of the difference and pathogenic genes and the medicament j are calculated by adopting the same steps as (4a1) - (4a3)

(4b2) Combining the normalized values of the drugs and the diseases to obtain the correlation value vectors of the disease gene set B of the difference and the pathogenic genes and all the drugs:

6. the method of claim 1, wherein the accuracy P of the disease de-noised differential gene set A predictive drug is calculated in (5)AAnd calculating the difference and the accuracy P of the disease gene set B for predicting the medicineBThe implementation is as follows;

(5a) using the correlation value vector AS in (4)ACalculating the accuracy P of the predicted drug of the denoised differential gene set AA

(5a1) The medicines are sequenced according to the sequence from small to large to obtain a first ordered medicine sequence R of the differential gene set A for denoising diseasesACalculating a correlation value vector ASAAbsolute value of (a):

Figure FDA0002556935160000054

wherein the content of the first and second substances,j is more than or equal to 1 and less than or equal to q, and q represents the total number of the medicaments;

(5a2) sequencing all the medicaments according to the rule that the absolute value is from large to small to obtain a second ordered medicament sequence AR of the differential gene set A for denoising the diseaseA

(5a3) Downloading a medicine set omega related to the disease from a related database, and respectively sorting R two medicines of a difference gene set A for denoising the diseaseAAnd ARACalculation accuracy PA1And PA2

Figure FDA0002556935160000061

Wherein M isA1Is shown in the drug order RANumber of top ranked drugs, UA1Is shown in the drug order RAMiddle front MA1The number of known drugs in the drug set omega, MA2Expressed in the drug order ARANumber of top ranked drugs, UA2Expressed in the drug order ARAMiddle front MA2The number of drugs known in the drug set Ω;

(5a4) comparison PA1And PA2If P is the size ofA1≥PA2The accuracy P of the drug prediction of the disease de-noised difference gene set AA=PA1Otherwise, PA=PA2

(5b) Using the correlation value vector AS in (4)BCalculating the accuracy P of the drug predicted by the disease gene set B of differences and causative genesB

(5b1) Correlation value vector AS of disease gene set B and all drugs according to difference and pathogenic genesBThe accuracies P of the disease gene sets B for which the differences and causative genes were obtained in the two ranks were calculated in the same steps as (5a1) - (5a3)B1And PB2

(5b2) Comparison PB1And PB2If P is the size ofB1≥PB2Accuracy P of drug prediction of disease Gene set B of differences and disease genesB=PB1Otherwise, PB=PB2

7. The method of claim 1, wherein (7) a threshold T is determined across the set A' of disease-optimal query genesupAnd TdownThe implementation is as follows:

(7a) genes of logFC >0 and logFC <0 in the gene set whose differential expression was significant, obtained for (2), respectively, constructed normal distributions symmetrical with respect to logFC and symmetrical with respect to logFC:

(logFCg)up=-logFCg-2*,logFCg<0,

(logFCg)down=-logFCg+2*,logFCg>0,

therein, logFCgExpression Change value of Gene g, (logFC)g)upRepresents a value that is symmetric about logFC (logFC) constructed by an existing gene gg)downA threshold value indicating a difference change in gene expression change value logFC, which is a value symmetrical about logFC constructed by an existing gene g;

(7b) calculating the mean value μ of the normal distribution in (7a) symmetric about logFCdownAnd standard deviation σdown:

Figure FDA0002556935160000071

Wherein G represents the number of genes in a normal distribution;

(7c) calculating the mean value μ of the normal distribution in (7a) about logFC-symmetryupAnd standard deviation σup

Figure FDA0002556935160000072

(7d) The maximum value and the minimum value of logFC in the disease optimal query gene set A' are respectively recorded as (logFC)maxAnd (logFC)minCalculating corresponding threshold value T according to the results of (7b) - (7c)upAnd Tdown

Figure FDA0002556935160000074

8. The method of claim 1, wherein the values of slope a and intercept b of the prediction model | Y | ═ aX + b are calculated in (9b) by the following formula:

Figure FDA0002556935160000076

Figure FDA0002556935160000077

wherein the content of the first and second substances,

Figure FDA0002556935160000078

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a drug relocation method which can be used for predicting candidate drugs with good treatment effects.

Background

It is well known in the art that drug development takes a long time, and it usually takes 11.4-13.5 years to develop new drugs from beginning to end, with costs of each drug being $ 1800 and 1800 billion. Despite the significant time and money invested in drug development, over 90% of drugs end up failing. For example, from 1949 to 2014 the U.S. food and drug administration FDA co-approved 150 drugs for cancer treatment, making the available drugs insufficient due to the existence of cancer subtypes and drug resistance, and it is therefore of great importance to explore new drug development strategies. Drug relocation is a new strategy for old drugs, saving a lot of time and cost required for drug development. Traditional drug development typically involves five stages: drug discovery and preclinical testing, drug safety review, clinical study of the drug, FDA review approval, and post-marketing FDA safety monitoring. While drug relocation has only four steps: compound identification, compound acquisition, drug development, and FDA after-market safety monitoring. The existing drug is used as a potential candidate drug to evaluate the treatment effect on the disease. This has the advantage of making the process of presuming a drug faster and less costly, taking advantage of several important drug properties that have been established including efficacy, pharmacokinetics, pharmacodynamics, and toxicity.

The current methods for drug relocation main stream can be divided into: methods based on drug information, genomics and transcriptomics data.

A method for drug relocation based on drug information.

Drug information-based approaches are primarily directed to predicting new indications for drugs by using four classes of information for drugs to explore common features among compounds. The first category is the use of chemical structure information of drugs to relocate drugs, and this category of methods uses the structural and chemical properties of drug molecules to calculate similarities between drugs based on known quantitative relationships between chemical structures and biochemical activities. Its activity is significantly increased or lost due to a slight change in the structure of the molecule. Therefore, such methods have a high false positive in application. The second method is to use the spatial structure information of the drug to reposition the drug, and this method mainly utilizes the spatial structure modeling of the drug molecule and the protein to simulate their direct physical interaction, which depends on the analysis of the drug molecule and the protein structure, however, some of the protein structures analyzed at present have errors and many important protein structures are not completely decomposed, which makes the modeling of the interaction between the compound and the protein incomplete, and therefore, this method also has high false positive. The third category is to use the side effect information of the drug to relocate the drug, the side effect information of the drug reflects the physiological consequences and the phenotypic expression of the drug, most of the methods are performed on the assumption that the side effect is associated with the disease when the drugs related to the disease share a plurality of side effects, and the accuracy of the prediction of the drug depends on whether the characteristics of the side effect of the drug have definite definitions. It is well known in the art that although strict preclinical assessments of side effects of drugs are performed, years of clinical use and post-market monitoring are required before complete identification of the side effects of a newly approved drug is possible. In addition, data redundancy is also a concern in the side effect field. The fourth category is the use of drug target information to relocate drugs, and such methods generally use drug targets and disease-related genes as drug and disease mediators to predict new drug-disease relationships. However, due to the small number of targets for current drugs, such methods often require the addition of additional information, introducing new noise. In summary, many drug relocation methods based on drug information face the problem of high false positives due to insufficient information.

Secondly, a drug relocation method based on genomics.

Genomics is the study of the structure, function and inheritance of a genome, including the interactions between genes and the human environment. Its main task is to determine the molecular sequences that make up the deoxyribonucleotide DNAs of an organism's genome. Current methods of drug relocation based on genomics can be divided into mathematical statistics and network methods. The mathematical statistics method is mainly used for judging the relationship between the drugs and the diseases by performing enrichment analysis on the gene sets corresponding to the drugs and the diseases. This approach is simple, easy to understand and implement, but depending on the disease or drug related gene set, the accuracy of the gene set may affect the accuracy of the results. The method using the network mainly constructs a genetic-related network or a disease network by using various data such as whole genome association study data, gene interaction data, metabolic data, and electronic medical records. The dependence of the method on the original data is large, the reliability of the result can be directly influenced by the incompleteness of the data set, for example, the existing protein network is incomplete, about 88 percent of the single nucleotide mutations related to diseases are positioned in intergenic and intronic regions, and the genetic effect of the parts can be hardly explained by the existing gene interaction data.

And thirdly, a drug relocation method based on transcriptomics.

Transcriptomes are the necessary links of proteomes that link genomic genetic information with biological functions, and are the main means for studying gene expression. The transcriptomics-based drug relocation method mainly aims at establishing a connection relation between differentially expressed genes or non-coding RNA in diseases and differentially expressed genes under drug perturbation to predict the influence of drugs on the diseases. For example, Chen et al, in the context of up-and down-regulation of genes, relocate the drug by dividing the genes into two parts, up-and down-regulating UCDB in disease and DCUB in disease but up-and down-regulating DCUB in drug. The method comprises the steps of using a human function connection network FLN as a background network, respectively mapping k genes with highest expression in diseases and m genes with lowest expression under the action of drugs into the network, using the k genes as seed genes, calculating prediction scores of the k genes and neighbor nodes, arranging the genes in a descending order according to the prediction scores, representing the condition that the m down-regulated genes are ranked in front through an area AUC under a characteristic curve of a subject, and using the AUC value as the score of the disease prediction drugs. Similarly, the m down-regulated genes are used as seed genes to calculate the fraction of the drug predicted disease. And synthesizing the two prediction scores to obtain the mutual prediction scores of the two groups of genes under the parameters k and m, sequencing the medicines in a descending order through the scores, calculating AUC by taking the medicines approved by FDA and the medicines used clinically as standards, obtaining the prediction result of UCDB, updating the parameters, repeating the experiment, and taking the result with the highest finally calculated AUC value as the final result of UCDB. Similarly, the prediction result of DCUB can be calculated. The accuracy of this method depends largely on the accuracy of the background network. In addition, the parameters of the disease are different in both modes, so a large number of experiments are required, which undoubtedly increases the difficulty of using the algorithm.

In addition, the join graph approach was first proposed by the join graph CMap database. The method utilizes transcriptome data to carry out reverse matching on the expression patterns of the drugs and the diseases, namely the expression patterns of genes in the diseases and the drugs are opposite, and the high expression in the diseases needs to be under the action of the drugs and vice versa. Whether the effect of the candidate drug on the disease is treatment or aggravation can be judged through the positive and negative matching scores so as to achieve the aim of screening the drug. The framework of the connection diagram method is mainly divided into three parts: gene expression profiles, disease query gene sets, and matching algorithms. At present, most of researches based on a connection diagram method framework are focused on gene expression profiles and matching algorithms, and the selection of a disease query gene set has a crucial influence on the accuracy of a prediction result. In addition, the methods select genes with remarkable differential expression, particularly those with high differential expression, as a query gene set of diseases, and completely do not consider pathogenic genes which are closely related to the diseases, so that the accuracy of the predicted result is low.

Disclosure of Invention

The invention aims to overcome the defects of the transcriptomics-based drug relocation method, and provides a drug relocation method based on differential expression data so as to provide a model for predicting an optimal disease query gene set and improve the accuracy of a prediction result.

The technical idea of the invention is as follows: finding an optimal query gene set for the disease through pathogenic genes and differential expression genes of the disease; constructing a linear model for predicting an optimal inquiry gene set of diseases according to experimental results of various diseases; and predicting the optimal query gene set of other diseases through the linear model to calculate the association score of the diseases and the medicines.

According to the technical idea, the implementation scheme of the invention comprises the following steps:

(1) downloading pathogenic gene data by using an OMIM database, and searching a pathogenic gene set L of the disease;

(2) downloading gene expression data, calculating each gene expression change value logFC and corresponding false positive FDR, and screening genes;

(3) two query gene sets were constructed:

(3a) deleting the top m genes and the bottom k genes in the screened gene set with obvious differential expression to obtain a disease de-noising differential gene set A, wherein k is more than or equal to 0, and m is more than or equal to 0;

(3b) sorting the pathogenic gene set L from small to large according to the logFC sequence of each gene, adding the gene with the logFC <0 to the top of the disease de-noising difference gene set A, and adding the gene with the logFC >0 to the bottom of the disease de-noising difference gene set A to obtain a disease gene set B of the difference and pathogenic genes;

(4) respectively calculating correlation value vectors AS of the difference gene set A and each medicine for denoising diseasesAAnd correlation value vector AS of disease gene set B of difference and pathogenic gene and each medicineB

(5) Calculating the accuracy P of the predicted drug of the denoised differential gene set AAAnd calculating the difference and the accuracy P of the disease gene set B for predicting the medicineB

(6) Comparing the accuracy P obtained in the step (5)AAnd PB

If PBHigher than PAIncreasing the values of the parameters k and m, deleting more noises, and repeating the steps (3) - (5) until PB≤PA(ii) a Otherwise, executing the step (7);

(7) recording the current difference gene set A of denoising the disease as an optimal inquiry gene set A 'of the disease, and determining a threshold value T at two ends of the optimal inquiry gene set A' of the diseaseupAnd Tdown

(8) Setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number as 20, and judging whenWhether the sum of the top threshold number and the bottom threshold number of the front is lower than a minimum value of 20: if yes, continuing to download other disease data from the TCGA database, and obtaining two end threshold values of the newly downloaded disease by adopting the steps (3) to (7)And

Figure BDA0002556935170000042

executing (9) until the total threshold number sum is not less than 20; otherwise, directly executing (9);

(9) constructing a prediction model of the optimal difference gene set threshold of the disease:

(9a) randomly taking 100000 numerical points from standard normal distribution as background values, recording the numerical points as vectors E, using Kolmogorov-Smirnov test statistic J to represent distribution difference between differential expression value of disease genes and 100000 background points, using the distribution difference as independent variable X of model, and using threshold values T at two ends of optimal query gene set of all diseasesupAnd TdownA dependent variable Y as a model;

(9b) according to the parameters obtained in the step (9a), constructing a prediction model as a linear model of | Y | ═ aX + b, wherein a is the slope of the model, and b is the intercept of the model;

(10) predicting the optimal query gene set of other diseases by using a prediction model, and calculating the accuracy P of M drugs before drug ranking through (4) - (5)MAnd then selecting the M medicines to obtain all candidate medicines with potential treatment effects on diseases, wherein M represents the number of medicines and is more than or equal to 10.

Compared with the prior art, the invention has the following advantages:

1. in the invention, in the process of acquiring the disease query gene set, pathogenic gene data are added, the query gene sets of two diseases are constructed by combining the differential genes and the pathogenic genes, the influence of the pathogenic genes on the diseases is fully considered, and the noise removal operation is carried out on the differential genes of the diseases by comparing the query gene sets of the two diseases.

2. The invention uses not less than 10 diseases to denoise the differential genes, uses the threshold value to express the gene set, uses the experimental data of various diseases and the threshold value thereof to construct the prediction model of the optimal differential gene set, uses the model to predict the optimal differential gene set for other diseases, and obtains more accurate medicine prediction accuracy compared with the prior art.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph showing the accuracy of the prediction of sarcoma drug relocation using the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

This example illustrates breast cancer in the TCGA database of cancer genomes, where all diseases in the TCGA database where the causative gene is present are used.

Referring to fig. 1, the drug relocation method based on differential expression data is implemented by the following steps:

step 1, downloading pathogenic gene data.

This example utilizes the OMIM database to search the set of causative genes for breast cancer, for a total of 22 genes, designated set L.

And 2, downloading gene expression data, calculating each gene expression change value logFC and corresponding false positive FDR, and screening genes.

2.1) downloading breast cancer data and calculating the gene expression change value logFC:

2.1.1) downloading gene expression data of the breast cancer data from a TCGA database, and identifying the sample types of the gene expression data in the TCGA according to a Barcode code provided by the TCGA to obtain a cancer patient sample set and a normal person sample set;

2.1.2) calculating the expression change value logFC of the genes in the breast cancer according to the downloaded breast cancer data:

Figure BDA0002556935170000051

wherein e islThe expression value of the gene in a sample l is represented, l represents the sample serial number in the gene expression data, when S +1 is not less than l and not more than S + W, the sample l is a cancer patient sample, when S +1 is not less than l and not more than S + W, the sample l is a normal person sample, S represents the number of the cancer patient samples, and W represents the number of the normal person samples;

2.2) calculating the false positive FDR corresponding to the gene in the breast cancer:

2.2.1) constructing an expression value vector θ for a gene from gene expression data in breast cancer:

θ=(e1,e2,…,el,…,es,es+1,es+2,…,es+f,…,eS+W),

wherein e islExpressing the expression value of the gene in the cancer sample l, and 1. ltoreq. l.ltoreq.S, eS+fExpressing the expression value of the gene in a normal human sample S + f, wherein S + f is more than or equal to 1 and less than or equal to W, f represents the sample serial number of the normal human in the gene expression data, and S + W represents the total sample number of the gene expression data;

2.2.2) randomly scrambling the serial numbers of the samples in the expression value vector theta of the genes in the breast cancer to obtain the expression value vector labeled by the scrambled samplesWherein the content of the first and second substances,

Figure BDA0002556935170000062

expressing the expression value of the gene in the cancer sample l after the label is disturbed,expressing the expression value of the gene in a normal human sample S + f after the label is disturbed;

2.2.3) vector θ*Performing random test, and calculating theta in the v random test*Is calculated as a statistical result value Hv

2.2.4) repeating the step (2.2.3) for 1000 times, and calculating the statistical significance value F of each gene expression change value:

wherein C represents the checking statistic value, and P represents theta in the random test*Is higher than the set of C, | P { H |V|HVV is more than C and more than or equal to 1 and less than or equal to 1000, and represents the number of elements in the set P;

2.2.5) rank the F from small to large, calculate the false positive FDR of the gene:

wherein N represents the total number of genes and i represents the ordering of the genes;

2.3) screening genes from the gene expression variance logFC and false positive FDR corresponding to the genes:

2.3.1) setting the threshold for differential changes in gene expression change value logFC to 1.5 and the threshold for false positive FDR to α to 0.05, respectively;

2.3.2) comparing the gene expression change value logFC and the false positive FDR with respective threshold values, and screening genes of which the logFC | is more than or equal to 1.5 and the FDR is less than or equal to 0.05 as the genes with obvious differential expression of the breast cancer, wherein 3166 genes with obvious differential expression of the breast cancer are selected in the example.

And 3, constructing two query gene sets of the breast cancer.

3.1) deleting the topmost m genes and the bottommost k genes in the screened gene set with obvious differential expression to obtain a de-noised differential gene set A, wherein k is more than or equal to 0, m is more than or equal to 0, and the parameter selected in the de-noised differential gene set A for the breast cancer in the example is k which is 0 and m which is 0;

3.2) sequencing a pathogenic gene set L of the breast cancer from small to large according to the logFC sequence of each gene, adding the gene with the logFC <0 to the top of a differential gene set A for denoising the breast cancer, and adding the gene with the logFC >0 to the bottom of the differential gene set A for denoising the breast cancer to obtain a disease gene set B of the difference of the breast cancer and the pathogenic gene;

step 4, respectively calculating correlation value vectors AS of the difference gene set A for breast cancer denoising and each medicineAAnd the difference of breast cancer and the correlation value vector AS of the disease gene set B of the pathogenic gene and each medicineB

4.1) calculating the correlation value vector AS of the difference gene set A and each medicine for denoising the breast cancerA

4.1.1) taking a difference gene set A for denoising the breast cancer as a query list of a Kolmogorov-Smirnov method, and taking an ordered list of genes under the action of each medicament as a reference list set in the Kolmogorov-Smirnov method;

4.1.2) constructing a position vector V (1.. n) of each gene in the reference list set, wherein n represents the number of genes in the reference list;

4.1.3) will query logFC in the list>0 as a query list Q of Kolmogorov-SmirnovupWill query logFC in the list<0 as a Down-Regulation query List Q of Kolmogorov-SmirnovdownSeparately, for each drug, a set of query genes Q is computationally investigatedupEnrichment score of ESupAnd downregulation of the query Gene set QdownEnrichment score of ESdown

Figure BDA0002556935170000071

Wherein the content of the first and second substances,score representing genes ranked top in the query list and reference list

Denotes the score of a gene before the query list and after the reference list, p denotes the gene rank, Vup(p) represents the position of the genes ranked as p in the upper query list in the reference list, and s1 represents the number of genes in the upper query list;

Figure BDA0002556935170000081

scores representing genes ranked later in the query list and the reference list;

score, V, representing genes ranked later in the query list and earlier in the reference listdown(p) represents the position of the genes ranked as p in the query list below in the reference list, and s2 represents the number of genes in the query list below;

4.1.4) enrichment score ES of the query Gene set according to (4.1.3)upAnd downregulating enrichment scores ES of the query gene setdownCalculating the correlation value as of the difference gene set A and each medicine for denoising the diseaseA

If ESupAnd ESdownThe same sign, then asA=0,

Otherwise, asA=ESup-ESdown

4.1.5) as for all drugs in the differential Gene set A byAThe scores are normalized:

wherein the content of the first and second substances,

Figure BDA0002556935170000084

the j is more than or equal to 1 and less than or equal to q, q represents the total number of the medicaments,

Figure BDA0002556935170000085

indicates all the drug scores as in the differential gene set AAThe maximum value of (a) is,

Figure BDA0002556935170000086

indicates all the drug scores as in the differential gene set AAThe minimum value of (a) is determined,representing the correlation value of the j < th > drug and the disease after normalization in the difference gene set A;

4.1.6) combining the medicines with the normalized values of the diseases thereof to obtain the correlation value vector of the difference gene set A for denoising the breast cancer and all medicines:

Figure BDA0002556935170000088

wherein the content of the first and second substances,

Figure BDA0002556935170000089

expressing the normalized correlation value of the jth drug and the disease of the difference gene set A for denoising the breast cancer, wherein j is more than or equal to 1 and less than or equal to q, and q expresses the total number of the drugs;

4.2) calculating correlation value vector AS of disease gene set B of breast cancer difference and pathogenic genes and each medicineB:

4.2.1) taking the disease gene set B as a query list of the Kolmogorov-Smirnov method, taking the ordered list of the genes under the action of each medicament as a reference list set in the Kolmogorov-Smirnov method, and adopting the same steps as (4.1.1) to (4.1.5) to calculate the difference of the breast cancer and the normalized correlation value of the disease gene set B of the pathogenic genes and the medicament j

4.2.2) combining the normalized values of the drugs and the breast cancer to obtain the difference of the breast cancer and the correlation value vector of the disease gene set B of the pathogenic genes and all the drugs:

wherein the content of the first and second substances,

Figure BDA0002556935170000093

the correlation value of the j drug of the disease gene set B representing the difference of the breast cancer and the pathogenic gene after being normalized with the disease.

Step 5, respectively calculating the predicted medicine accuracy P of two gene sets, namely a difference gene set A for denoising diseases and a disease gene set B for differentiating and causing diseasesAAnd PB

5.1) utilizing the correlation value vector AS of the difference gene set A and all medicines for denoising the breast cancer in the step 4ACalculating the predicted drug accuracy P of the difference gene set A for denoising the breast cancerA

5.1.1) according to ASAOrdering the medicines in a small-to-large order to obtain a first ordered medicine sequence R in a difference gene set A for denoising the breast cancerA

5.1.2) according to the correlation value vector AS of the difference gene set A and all the medicines for denoising the breast cancer in the step 4ACalculating a relevance value vector ASAAbsolute value of (a):

Figure BDA0002556935170000094

wherein the content of the first and second substances,represents the absolute value of the correlation score of the jth drug and the disease in the differential gene set A, wherein j is more than or equal to 1 and less than or equal to q, and q represents the total number of the drugs;

5.1.3) according to absolute value | ASAI rule from big to small, all the medicines are sequenced to obtain a second ordered medicine sequence AR of the difference gene set A for denoising the breast cancerA

5.1.4) downloading a drug set omega related to the breast cancer from a drug toxicity database CTD as a standard drug set related to the breast cancer;

5.1.5) two drug orderings R in the differential Gene set A for Breast cancer De-noisingAAnd ARAAnd the standard drug set omega of breast cancer, calculating the first drug ranking R of breast cancerAAccuracy P ofA1And second drug order AR for breast cancerAAccuracy P ofA2

Wherein M isA1Is shown in the drug order RANumber of top ranked drugs, UA1Represents MA1Number in drug set Ω, M, of individual drugsA2Expressed in the drug order ARANumber of top ranked drugs, UA2Represents MA2The number of drugs in the drug set Ω;

5.1.6) comparison of PA1And PA2If P is the size ofA1≥PA2The accuracy P of medicine prediction is carried out on the difference gene set A of breast cancer denoisingA=PA1Otherwise, PA=PA2

5.2) correlation value vector AS of disease gene set B and all drugs according to the differences and pathogenic genes in step 4BCalculating the accuracy P of the predicted drugB

5.2.1) correlation value vector AS of disease Gene set B with all drugs based on differences and disease genesBThe same procedure as in (5.1.1) to (5.1.5) was used to obtain the difference in breast cancer and the accuracy P of the disease gene set B of the causative genes in both ranksB1And PB2

5.2.2) comparison of PB1And PB2If P is the size ofB1≥PB2The difference of breast cancerAccuracy P of drug prediction of disease gene set B of heterogenous and pathogenic genesB=PB1Otherwise, PB=PB2

Step 6, comparing the accuracy P of the predicted medicine of the two inquiry gene sets A, B of the breast cancerAAnd PB

Predicting accuracy P of medicine by using difference gene set A for denoising breast cancerAAccuracy P of predicting drugs with disease Gene set B of differences and disease genesBAnd (3) comparison:

if PBHigher than PAIncreasing the values of the parameters k and m, deleting more noises, and repeating the steps 3-5 until PB≤PA

Otherwise, executing step 7;

step 7, recording the current difference gene set A for denoising the breast cancer as an optimal query gene set A 'of the breast cancer, and determining threshold values T at two ends of the optimal query gene set A' of the breast cancerupAnd Tdown

7.1) constructing two normal distributions for the gene set obtained in step 2:

7.1.1) a normal distribution symmetrical about logFC ═ 1.5 was constructed for genes in the gene set with logFC > 0:

(logFCg)down=-logFCg+3,logFCg>0,

therein, logFCgExpression Change value of Gene g, (logFC)g)downRepresents a value symmetrical about logFC of 1.5 constructed by an existing gene g;

7.1.2) constructing a normal distribution for the genes of logFC <0 in the gene set that is symmetric about logFC-1.5:

(logFCg)up=-logFCg-3,logFCg<0,

wherein (logFC)g)upRepresents a value symmetrical about logFC-1.5 constructed by an existing gene g;

7.2) calculating the mean value μ of the normal distribution in (7.1.1) symmetrical about logFC ═ 1.5downHebiaoTolerance sigmadown:

Figure BDA0002556935170000111

Wherein G represents the number of genes in a normal distribution;

7.3) calculating the mean value μ of the normal distribution in (7.1.2) symmetrical about logFC-1.5upAnd standard deviation σup

Figure BDA0002556935170000113

7.4) the maximum and minimum values of logFC in the Breast cancer optimal query Gene set A' were respectively recorded as (logFC)maxAnd (logFC)minCalculating corresponding threshold value T according to the results of (7.2) - (7.3)upAnd Tdown

In this example, the bottom gene threshold T of the optimal differential gene set A' determined for breast cancerupTop gene threshold T of optimal differential gene set a ═ 2.03down=-1.56;

And 8, determining whether to download other disease data according to the sum of the current top threshold number and the current bottom threshold number.

8.1) adding the current top threshold quantity and the bottom threshold quantity to obtain the total threshold quantity sum;

8.2) setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number to 20;

8.3) comparing the total threshold number sum with a set total threshold number minimum value of 20:

if sum<20, continuing to download other disease data from the TCGA database, and adopting the steps 3-7 to obtain two end thresholds of the newly downloaded diseaseAndexecuting step 9 until the total threshold number sum is not less than 20;

otherwise, step 9 is performed directly.

And 9, constructing a prediction model of the optimal difference gene set threshold value of the disease.

9.1) determining independent variable X and dependent variable Y of the prediction model:

9.1.1) randomly taking 100000 numerical points from the standard normal distribution as background values and marking as a vector E;

9.1.2) characterization of the distribution differences between differential expression values of disease genes and vector E using Kolmogorov-Smirnov test statistic J as independent variable X of the model;

9.1.3) two-terminal threshold T of optimal query Gene set for all diseasesupAnd TdownA dependent variable Y as a model;

9.2) according to the parameters obtained in (9.1), constructing a prediction model as a linear model with | Y | ═ aX + b, wherein a is the slope of the model, b is the intercept of the model, and the values of a and b are calculated by the following formula:

wherein the content of the first and second substances,

Figure BDA0002556935170000125

means representing the independent variable X

Figure BDA0002556935170000126

Means of absolute value of dependent variable Y

Figure BDA0002556935170000128

D represents the number of independent variables and dependent variables used to construct the prediction model, XdDenotes the d-th value, Y, in the argumentdRepresenting the d-th value in the dependent variable.

In this example, the prediction model is: y | ═ 14.889X + 1.182.

And step 10, predicting the optimal query gene set of the sarcoma and carrying out drug prediction.

10.1) predicting the optimal query gene set of the sarcoma by using the prediction model constructed in the step 9, taking the predicted optimal query gene set of the sarcoma as an input set of Kolmogorov-Smirnov, and calculating the accuracy rates P of M drugs before drug ranking through the steps 4-5M

10.2) selecting the first M medicaments of the medicament ranking in the step (10.1) to obtain all candidate medicaments with potential treatment effect on diseases, wherein M represents the number of the medicaments, and M is more than or equal to 10.

In this example, the accuracy is P20All the drugs listed in the top 20, M-20, were selected to obtain drug candidates with potential therapeutic effects on sarcomas.

The technical effects of the present invention will be described below with reference to simulation experiments.

1. Simulation conditions are as follows:

the computer hardware CPU of the simulation experiment is Intel Core (TM) i5, the memory of the computer hardware is 4G, and the computer software: RStudio integrated development software on a WINDOWS 7V system.

2. Simulation content:

screening differentially expressed genes of colorectal adenocarcinoma, breast cancer, lung adenocarcinoma, lung squamous cell carcinoma, hepatocellular carcinoma, prostate cancer, pancreatic ductal adenocarcinoma, bladder urothelial carcinoma, glioblastoma and head and neck squamous cell carcinoma by using the method to obtain upper and lower regulated gene thresholds of an optimal differential gene set, and constructing a model; and then the model is used for carrying out simulation experiments on the accuracy distribution of each ranking section of the prediction results of the potential therapeutic drugs for the sarcoma with the disease, and the results are shown in figure 2. The vertical axis in FIG. 2 represents the accuracy of each rank of the sarcoma drug-relocation prediction result, and the horizontal axis represents the number of drugs.

As can be seen from FIG. 2, the accuracy of the predicted results of the top 10 ranked drugs reaches 90%, and the accuracy of the predicted results of the top 20 ranked drugs reaches 70%. As can be seen from the accuracy of the sarcoma drug prediction result, the invention effectively improves the accuracy of the drug relocation prediction result.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于多层网络表示学习的药物靶标相互作用预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!