Drug target interaction prediction method based on multilayer network representation learning

文档序号:1075097 发布日期:2020-10-16 浏览:8次 中文

阅读说明:本技术 基于多层网络表示学习的药物靶标相互作用预测方法 (Drug target interaction prediction method based on multilayer network representation learning ) 是由 鱼亮 尚奕帆 于 2020-06-28 设计创作,主要内容包括:本发明公开了一种基于多层网络表示学习的药物靶标相互作用预测方法,主要解决现有技术预测准确率低的问题。其方案是:从药物和蛋白质数据库中下载数据,分别构建药物和蛋白质的多层相似性网络;对这两种相似性网络分别计算其扩散状态,并分别整合各自扩散状态得到药物和蛋白质的特征向量;将已知的药物靶标相互作用数据作为监督信息,把药物和蛋白质特征向量投入到同一药物靶标空间中,使用双线性函数分别得到药物和蛋白质的投影矩阵;根据这两个投影矩阵得到药物靶标相互作用的预测得分矩阵并对其排名;把排名靠前的8对未知药物靶标对视作潜在的药物靶标相互作用。本发明提高了药物靶标相互作用的预测准确率,可用于预测药物靶标对的候选。(The invention discloses a medicine target interaction prediction method based on multilayer network representation learning, and mainly solves the problem of low prediction accuracy in the prior art. The scheme is as follows: downloading data from a drug and protein database, and respectively constructing a multilayer similarity network of the drug and the protein; respectively calculating the diffusion states of the two similarity networks, and respectively integrating the respective diffusion states to obtain the characteristic vectors of the drug and the protein; the known drug target interaction data is used as supervision information, the drug and protein characteristic vectors are put into the same drug target space, and projection matrixes of the drug and the protein are respectively obtained by using a bilinear function; obtaining a prediction score matrix of the interaction of the drug targets according to the two projection matrices and ranking the prediction score matrix; the top 8 pairs of unknown drug target pairs were targeted for potential drug target interactions. The method improves the prediction accuracy of the drug target interaction, and can be used for predicting the candidate of the drug target pair.)

1. A method for predicting drug target interaction based on multilayer network representation learning comprises the following steps:

(1) downloading data of the drug and the protein, and respectively constructing a drug similarity network and a protein similarity network:

(1a) downloading structure data CH of n drugs from any one database related to chemical structures of drugsnConstruction of a network of similarity of chemical structures of drugs Dch

(1b) Downloading n medicines and medicine side effect data corresponding to the n medicines from any one database related to the medicine side effect to obtain a medicine and side effect matrix MseConstruction of drug side-effect similarity network Dse

(1c) Downloading n medicines and medicine-related disease data corresponding to the n medicines from any database related to medicine diseases to obtain a medicine-disease relation matrix MdiConstruction of drug disease similarity network Ddi

(1d) Downloading n drugs and their interactions from any one database related to drug interactions, and constructing a drug-drug interaction similarity network Ddr

(1e) Downloading m drugs and the interaction among the m proteins from any one database related to the protein interaction, and constructing a protein-protein interaction similarity network Tpr

(1f) Downloading of sequence data SEQ of m proteins from any database related to protein sequencesmConstruction of protein sequence similarity network Tseq

(1g) Downloading M proteins and relevant disease data corresponding to the M proteins from any database related to protein diseases to obtain a protein-disease relation matrix MpdiConstruction of protein disease similarity network Tdi

(2) Downloading known drug target interaction data between n drugs and m proteins from any one database related to the drug targets to obtain a drug target interaction matrix P;

(3) calculating the network diffusion state for the similarity network:

(3a) 7 layers of similarity networks constructed based on (1) are respectively and independently captured by using RWR (re-starting random walk algorithm) to acquire similarity networks of each layerTo obtain a network D of similarity of chemical structures of the drugschInitial state matrix of

Figure FDA0002557041790000011

(3b) Using the orthogonal point mutual information PPMI method to align the initial state matrix The node calculates the co-occurrence probability to obtain the corresponding diffusion state matrix

Figure FDA0002557041790000024

(4) Using a multilayer network representation learning method to obtain feature vectors of the drug and the target:

(4a) diffusion state matrix for chemical structure of medicine by using multilayer network representation learning method

Figure FDA0002557041790000025

(4b) diffusion state matrix for protein sequence by using multilayer network representation learning method

Figure FDA0002557041790000029

(5) simultaneously putting the drug characteristic vector X and the target characteristic vector Y into a drug target space Z, taking a drug target interaction matrix P as supervision information, and minimizing the drug target by using an alternate minimization methodInteraction prediction matrixThe difference between the obtained matrix and the known matrix P is used for obtaining a final medicine target interaction prediction matrixThe entries in the matrix are the predicted drug target interaction scores, completing the prediction of drug target interactions.

2. The method of claim 1, wherein a network of similarity of chemical structures of drugs D is constructed in (1a)chThe implementation is as follows:

1a1) downloading 882 drug chemical structure data CH from drug-related database882

1a2) Chemical structure data CH based on drugs882Obtaining a SMILES chemical structural formula feature vector of the drug by using a tool package rcdk of an R language;

1a3) obtaining a drug chemical structure similarity network D by using a R language toolkit finger print based on the feature vector of the SMILES chemical structural formula of the drugchCalculating any one element [ D ] in the networkch]ij

Wherein, A and B are respectively the characteristic vectors of the chemical structural formulas of two different medicines, representing the inner product operation of the vectors, | | | | | purple sweet2Representing the operation of the vector modulo.

3. The method of claim 1, wherein a drug side effect similarity network D is constructed in (1b)seThe implementation is as follows:

1b1) downloading 882 medicaments and 5439 side effect medicament data corresponding to the 882 medicaments from a medicament side effect database to obtain a medicament and side effect matrix MseThe matrix MseThere are 882 rows and 5439 columns, where the rows represent drugs and the columns represent side effects;

1b2) drug side effect matrix MseAs a side effect feature of a drug, a side effect feature set of drugs is obtained:

Figure FDA0002557041790000031

1b3) drug side effect based feature setObtaining a drug side effect similarity network DseAny element in the network is calculated as follows:

wherein, Fi seRepresenting the characteristic vector of the side effect of the ith drug,represents the side effect characteristics of the jth drug, i 1,2, 3.., 882, j 1,2, 3.., 882.

4. The method of claim 1, wherein a drug disease similarity network D is constructed in (1c)diThe implementation is as follows:

1c1) downloading 882 medicaments and medicament disease data of 6902 medicaments corresponding to the 882 medicaments from a medicament disease related database to obtain a medicament and disease matrix MdiMatrix MdiThere are 882 rows and 6902 columns, where the rows represent drugs and the columns represent diseases;

1c2) drug disease matrix MdiAs the disease characteristics of a drug, obtaining a disease characteristic set of drugs:

1c3) base ofIn the drug disease feature set

Figure FDA0002557041790000036

wherein, Fi diRepresents the disease characteristics of the ith drug,represents the disease characteristics of the j-th drug, i 1,2, 3.., 882, j 1,2, 3.., 882.

5. The method of claim 1, wherein:

construction of drug interaction network D in (1D)drFirstly downloading the interaction among 882 medicines from a medicine related database, and then judging whether the interaction exists between the medicine i and the medicine j: if present, then

Figure FDA0002557041790000042

Construction of protein interaction network T in said (1e)prFirst, 1449 interactions between proteins are downloaded from a protein-related database, and then whether an interaction exists between protein i and protein j is determined: if present, then

Figure FDA0002557041790000044

Constructing a protein sequence similarity network T in the step (1f)seqThe method is to download 1449 protein sequence data SEQ from a protein-related database1449And then calculating the sequence similarity network T of the protein by utilizing the Smith-Waterman algorithmseq

6. The method of claim 1, wherein step (1g) comprises constructing a protein disease similarity network TdiThe implementation is as follows:

1g1) downloading 1449 proteins and 6902 protein data of diseases corresponding to the 1449 proteins from a protein disease related database to obtain a protein and disease matrix MpdiThe matrix MpdiThere are 1449 rows and 6902 columns, where rows represent proteins and columns represent diseases;

1g2) protein disease matrix MpdiAs a disease signature of a protein, a set of disease signatures of proteins is obtained:

Figure FDA0002557041790000046

1g3) protein-based disease feature setObtaining a protein disease similarity network TdiAny element in the network is calculated as follows:

wherein, Fi pdiA disease feature vector representing the ith protein,represents the disease characteristics of the j-th protein, i 1,2, 3.

7. The method according to claim 1, wherein the initial state matrix of each layer of similarity network is calculated in (3a) by using random walk algorithm RWR to obtain the similarity network D of the chemical structure of the drugchFor example, the following is implemented:

3a1) for similarity network DchIs normalized to obtain a probability transfer matrix

Figure FDA0002557041790000051

3a2) For probability transition matrixCapturing network structure characteristics through a restarted random walk RWR algorithm:

Figure FDA0002557041790000053

wherein the content of the first and second substances,is the row vector of network node i after t steps of walk,is an initial one-hot encoding vector, the ith position of the vector is 1, the rest is 0, α is the restart probability;

3a3) within t steps

Figure FDA0002557041790000056

Figure FDA0002557041790000057

wherein, i ∈ [1, 2.,. 882 ]]T is the set step length, riIs the vector of the drug chemical structure similarity network i node;

3a4) the global topological structure vector r of the 882 nodes obtained in the previous stepiForm an initial state matrix Rch

8. The method of claim 1, wherein the diffusion state matrix of the initial state matrix is calculated in (3b) by using the method of orthogonal point mutual information PPMI to obtain the initial state matrix R of the chemical structure of the drugchFor example, the following is implemented:

3b1) for initial state matrix RchCalculating the co-occurrence probability of the nodes by using the orthogonal point mutual information PPMI method:

Figure FDA0002557041790000058

wherein the content of the first and second substances,is the initial state matrix R of the chemical structure of the drugchThe probability of association of the intermediate nodes i and j,co-occurrence probability of nodes i and j;

3b2) by co-occurrence probability between any two nodesDiffusion state matrix constituting chemical structure similarity network of drug

9. The method of claim 1, wherein (4a) integrating the drug multilayer diffusion state matrix using a multilayer network representation learning method to obtain a drug feature vector matrix X is implemented as follows:

4a1) matrix of all diffusion states for drugs

Figure FDA0002557041790000061

wherein the content of the first and second substances,

Figure FDA0002557041790000063

4a2) embedding all features

Figure FDA0002557041790000066

Wherein, the [ alpha ], [ beta ]]Is a conditioned activation function, WcIs a stitching weight matrix, BcIs a stitching bias matrix;

4a3) for common feature Hc,1Then, the coding conversion of the L layer is carried out to obtain the converted coding characteristics Hc,p+1Comprises the following steps:

Hc,p+1=σ(WpHc,p+Bp),

wherein p ∈ { 1., L } is the number of coding layers, WpIs a transcoding weight matrix, BpIs a transcoding offset matrix;

4a4) for the converted coding characteristics Hc,L+1Calculating decoding conversion of L layer to obtain converted decoding characteristic Hc,p+L+1Comprises the following steps:

Hc,p+L+1=σ(Wp,1Hc,p+L+Bp,1),

where p ∈ { 1., L } is the number of decoding layers, Wp,1Is a decoding conversion weight matrix, Bp,1Is a decoding translation bias matrix; (ii) a

4a5) Using the converted decoding characteristics Hc,2L+1Calculating the decoding embedding of the diffusion state matrix of each layer of the drug network:

wherein the content of the first and second substances,

Figure FDA0002557041790000069

4a6) embedding features using decodingReducing diffusion state matrix of each layer of drug network

Figure FDA00025570417900000614

Wherein the content of the first and second substances,is to restore the weight matrix to the original weight matrix,

Figure FDA00025570417900000616

4a7) minimizing raw diffusion state matrixAnd a reduction diffusion state matrixDifference between θ:

Figure FDA0002557041790000071

wherein l (—) is a sample-wise binary cross-entry function,is the function of the minimization of the function,is the regularization constraint item of all parameter matrixes in the encoding and decoding process;

4a8) after the theta is taken as the minimum value, the difference between the restored diffusion state matrix and the original diffusion state matrix is minimum, and at the moment, the coding characteristics H of the corresponding middle layerc,L+1The topological structure of the drug in each layer of similarity network can be captured, and the characteristic of the drug, namely the coding characteristic H of the middle layer can be representedc,L+1As the drug feature vector X.

10. The method of claim 1, wherein the drug target interaction prediction matrix obtained in (5) is achieved as follows:

5a) putting the drug feature vector X into a drug target space Z to obtain a projection vector XG of X in the Z space, wherein G is a drug feature vector conversion matrix;

5b) putting the target characteristic vector Y into a drug target space Z to obtain a projection vector YH of Y in the Z space, wherein H is a target characteristic vector conversion matrix;

5c) according to the medicineThe projection vector XG of the object and the projection vector YH of the target, and transposing the projection vector YH (YH)TTo obtain a prediction matrix of drug target interaction

Wherein HTIs the transpose of the H matrix, YTIs a transposed matrix of the Y matrix;

5d) minimizing the matrix P and the prediction matrix by using an alternative minimization method by using the known drug target interaction P as supervision information

Figure FDA0002557041790000076

Figure FDA0002557041790000077

wherein, PijIs the interaction of drug i with target j, if Pij1 indicates the presence of an interaction, Pij0 indicates that no interaction is present; x is the number ofiIs the ith row eigenvector in the drug eigenvector matrix X,is a target eigenvector matrix YTThe characteristic vector of the jth column of the (j),calculating Frobenius norm of a matrix, wherein lambda is a penalty parameter, and when the Frobenius norm of a conversion matrix G and H is too large, the penalty is larger, otherwise, the penalty is smaller;

5e) drug target interaction prediction matrixWith minimal difference to the interaction matrix P of known drug targets

Figure FDA0002557041790000082

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a drug target interaction prediction method which can be used for providing candidate drug target interaction in a drug relocation experiment.

Background

The drug target refers to biological macromolecules which have a drug effect function in vivo and can be acted by drugs, such as certain biological macromolecules of proteins, nucleic acids and the like. The drug target interaction means that drug molecules are combined with biological macromolecules, namely proteins, in a human body and play a role. If a drug is bound to different target proteins, the effect of the drug will vary, and if the predicted drug target is associated with a disease, the drug may have a potential therapeutic effect on the disease.

The prediction of drug target interactions is an important step in drug relocation, and the purpose is to predict proteins, i.e. drug targets, on which drugs may act, and further to discover potential therapeutic effects of drugs. Therefore, more drug target interactions can help people to improve understanding in pharmacology and fully exert more curative effects of the drugs. If a new target of the drug is determined and new application of the drug is found, the cost of drug research and development can be greatly reduced, the period is shortened, and the risk that the side effect of the new drug cannot pass clinical examination is reduced.

Drug target interactions are determined experimentally with reliable results but at a high cost. Therefore, it is difficult to determine each pair of drug-target interactions using only experimental methods. Therefore, it is necessary to predict drug target interaction by a calculation method, so as to reduce the experimental range, reduce the cost and shorten the time.

These existing calculation methods are mainly based on the following assumptions: similar drugs may have the same target and vice versa. Existing calculation methods can be divided into two categories according to the type of data used: a model for predicting drug target interaction based on single-type data and based on multi-type data integration:

prediction model based on single type data

The prediction model based on single type data has the following algorithm according to different data types: chemical structure, drug side effects, gene expression data, etc.

The drugs as compound molecules all have chemical structures, and different chemical structures have different effects, so that the similarity relation between the drugs can be described through the chemical structures of the drugs. The chemical structural molecular formula of the medicine is decomposed, and a high-dimensional vector is used for representing the chemical structural characteristics of the medicine. For example, one dimension indicates whether the chemical structure of the drug contains a benzene ring, and if so, 1 is labeled, whereas 0 is labeled. Therefore, the chemical structure of each drug is represented by a vector, and the similarity between the drugs is measured by calculating the distance between the drug vectors, so as to predict the target of the drug.

The side effects of the drugs also contain important clinical phenotypic information, and the similarity of the phenotypic side effects is used to deduce whether the two drugs have a common target. All side effect reactions are represented as a non-repeating list using the side effect event reporting system AERS of the U.S. food and drug administration FDA. And a high-dimensional vector is used for representing the side effect characteristics of the medicine, if the mark is 1, the medicine is provided with the adverse reaction in the side effect list corresponding to the position, and if the mark is 0, the medicine is provided with no adverse reaction. The final drug can be represented as a 0,1 feature vector, and the distance between vectors is calculated to measure similarity.

When the drug acts, the change of in vivo gene expression is caused, which is an important characteristic in transcriptomics, so that the drug target interaction can be predicted by using the characteristic of gene expression profile change caused by drug treatment. The CMap is a gene expression profile database under the interference of high-flux compounds, gene expression data under the action of CMap drugs are provided in the database, and a machine learning classification technology is adopted, so that the gene expression data can be independently used for predicting the drug target interaction.

However, the similarity calculated using such single-type data includes bias, while multi-type data includes different information, and many studies have been proposed to integrate multi-type data for drug target interaction prediction. Compared with a method based on single-type data, information supplement exists among the multi-type data, and therefore integration of the multi-type data can finally improve the accuracy rate of predicting drug target interaction.

Prediction model based on multi-type data integration

To integrate multiple types of data, there are roughly two types of approaches that people use, respectively, web-based approaches and machine learning-based approaches.

The network-based method is to construct a plurality of relationship networks according to different types of data, and then integrate multi-type information by using the diffusion of the information in the networks. The method firstly constructs a plurality of medicine similarity networks according to different types of data of medicines, the similarity networks describe the relationship among the medicines from multiple angles, and then fuses the similarity networks by using a network fusion method such as SNF. Unlike the conventional method, the method is a nonlinear fusion, so that a plurality of similar networks form an information complementation. And a random walk method based on the network is adopted, the core of the random walk is diffusion without any rule, and finally, the relationship among the network nodes can be stabilized in a probability distribution. The traditional random walk method is in a homogeneous network, and two networks for interacting the drug and target similar networks with known drug targets are combined in the prior art, and a bidirectional random walk method is adopted in a heterogeneous network to conjecture potential drug target interaction.

The machine learning-based method learns the characteristics of the drug and the target, and then predicts the drug target interaction according to the characteristics. The method for learning the characteristics comprises a method for decomposing a cooperation matrix based on a homogeneous network; and a Laplace regularization sparse subspace learning LRSSL method is also used, and a Laplace regularization item is added to satisfy the smoothness of the subspace. In the method, a plurality of heterogeneous networks constructed by different types of data are projected into a common feature space, so that a plurality of networks can be integrated into one network, and then features are learned from a single integrated network.

However, these methods of integrating multiple types of data do not integrate data well. Firstly, by a network fusion method, the diffusion state is directly used as a feature or a prediction score, and the influence of noise in different networks is easy to be caused. Second, integrating multiple networks into one network or investing in a common subspace may result in the loss of different network-specific information, since information from multiple data sources is mixed and indistinguishable. These factors all affect the accuracy of predicting drug target interactions.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a medicine target interaction prediction method based on multilayer network representation learning, so that the loss of different network specificity information and the noise of a multi-type data network are reduced, and the prediction accuracy is improved.

The technical idea of the invention is as follows: constructing a plurality of similarity networks by using multi-set chemical data of the drugs and the proteins, and calculating the topological structure characteristics of the diffusion state capture network of each similarity network; integrating a plurality of network diffusion states by using a multilayer network representation learning method, learning the characteristic vectors of the medicine and the target, putting the characteristic vectors of the medicine and the target into a medicine-target space, and predicting the medicine target interaction score by using a matrix completion method.

According to the above thought, the implementation steps of the invention include the following:

(1) downloading data of the drug and the protein, and respectively constructing a drug similarity network and a protein similarity network:

(1a) downloading structure data CH of n drugs from any one database related to chemical structures of drugsnConstruction of a network of similarity of chemical structures of drugs Dch

(1b) Downloading n drugs and drug side effect data of side effects corresponding to the n drugs from any one database related to drug side effects to obtain a drug and side effect matrix MseConstruction of drug side-effect similarity network Dse

(1c) Downloading n medicines and medicine-related disease data corresponding to the n medicines from any database related to medicine diseases to obtain a medicine-disease relation matrix MdiConstruction of drug disease similarity network Ddi

(1d) Downloading n drugs and phases between the n drugs from any one of the databases associated with drug interactionsInteraction, construction of drug-drug interaction similarity network Ddr

(1e) Downloading m drugs and the interaction among the m proteins from any one database related to the protein interaction, and constructing a protein-protein interaction similarity network Tpr

(1f) Downloading of sequence data SEQ of m proteins from any database related to protein sequencesmConstruction of protein sequence similarity network Tseq

(1g) Downloading M proteins and relevant disease data corresponding to the M proteins from any database related to protein diseases to obtain a protein-disease relation matrix MpdiConstruction of protein disease similarity network Tdi

(2) Downloading known drug target interaction data between n drugs and m proteins from any one database related to the drug targets to obtain a drug target interaction matrix P;

(3) calculating the network diffusion state for the similarity network:

(3a) respectively and independently capturing the topological structure of each layer of similarity network by using a re-starting random walk algorithm RWR for 7 layers of similarity networks constructed based on the step (1) to obtain a medicine chemical structure similarity network DchInitial state matrix of

Figure BDA0002557041800000041

Drug side effect similarity network DseInitial state matrix of

Figure BDA0002557041800000042

Interaction similarity network between drugs DdrInitial state matrix of

Figure BDA0002557041800000043

Drug disease similarity network DdiInitial state matrix ofProtein sequence similarityT of sexual networkseqInitial state matrix of

Figure BDA0002557041800000045

Interaction similarity network T between proteinsseqInitial state matrix of

Figure BDA0002557041800000046

Protein disease similarity network TdiInitial state matrix of

(3b) Using the orthogonal point mutual information PPMI method to align the initial state matrix

Figure BDA0002557041800000048

Figure BDA0002557041800000049

The node calculates the co-occurrence probability to obtain the corresponding diffusion state matrix

Figure BDA00025570418000000410

(4) Using a multilayer network representation learning method to obtain feature vectors of the drug and the target:

(4a) diffusion state matrix for chemical structure of medicine by using multilayer network representation learning methodDrug side effect diffusion state matrix

Figure BDA00025570418000000413

Drug interaction diffusion state matrix

Figure BDA00025570418000000414

And drug disease spreading status matrix

Figure BDA00025570418000000415

Integrating to obtain a drug characteristic vector matrix X;

(4b) diffusion state matrix for protein sequence by using multilayer network representation learning methodProtein interaction diffusion state matrixProtein disease spreading status matrix

Figure BDA00025570418000000418

Integrating to obtain a target characteristic vector matrix Y;

(5) simultaneously putting the drug characteristic vector X and the target characteristic vector Y into a drug target space Z, taking a drug target interaction matrix P as supervision information, and minimizing a drug target interaction prediction matrix by using an alternate minimization methodThe difference between the obtained matrix and the known matrix P is used for obtaining a final medicine target interaction prediction matrix

Figure BDA00025570418000000420

The entries in the matrix are the predicted drug target interaction scores, completing the prediction of drug target interactions.

Compared with the prior art, the invention has the following advantages:

1. the invention predicts the interaction of the drug targets by integrating multi-type data, can describe the similarity between the drugs and the targets from multiple angles, avoids the defect of bias existing in the existing method adopting single-type data, and improves the accuracy of predicting the interaction of the drug targets.

2. According to the method, the multilayer similarity network is integrated by using a multilayer network representation learning method, the influence of multilayer network noise is reduced according to the denoising characteristic of the automatic encoder, and the accuracy of predicting the interaction of the drug targets is improved.

3. The invention uses multilayer network representation learning method to generate the characteristic vector of the drug and the protein, and can capture the nonlinear conversion characteristic of the network topology structure to obtain the high-quality drug and protein characteristic vector.

4. According to the method, the multi-layer network representation learning method is used for training data to predict the interaction of the drug targets, so that the risk of overfitting of the training data due to excessive parameters can be prevented, and the accuracy of predicting the interaction of the drug targets is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph comparing the accuracy of the present invention in predicting drug target interactions with the prior art multi-layer network representation learning method MDA.

Detailed Description

The following detailed description of specific embodiments and effects of the present invention is provided in conjunction with the accompanying drawings:

referring to fig. 1, the implementation steps of this example are as follows:

step 1, downloading data of drugs and proteins, and constructing a drug similarity network and a protein similarity network.

1.1) downloading the chemical structure data of the medicine: the related databases of the chemical structure of the medicine comprise a drug Bank database, a CTD database and the like, and the downloaded database of the embodiment adopts but is not limited to the drug Bank database to construct the similarity network D of the chemical structure of the medicinech

1.1.1) downloading 882 drugs' chemical structure data CH from DrugBank database882

1.1.2) chemical Structure data CH based on drugs882Obtaining a SMILES chemical structural formula feature vector of the drug by using a tool package rcdk of an R language;

1.1.3) obtaining a drug chemical structure similarity network D based on the SMILES chemical structural formula feature vector of the drug by using a toolkit finger print of R languagechCalculating any one element [ D ] in the networkch]ij

Figure BDA0002557041800000051

Wherein, A and B are characteristic vectors of chemical structures of the medicine, representing inner product operation of the vectors, | | | | purple sweet2An operation representing a vector norm;

1.2) downloading the side effect data of the medicine: the database related to the side effect of the drug comprises a SIDER database, a CTD database and the like, the database downloaded in the embodiment adopts, but is not limited to the SIDER database, and a network D for constructing similarity of the side effect of the drug is constructedse

1.2.1) downloading 882 medicines and 5439 side effects corresponding to the 882 medicines from the SIDER database to obtain a medicine and side effect matrix MseThe matrix MseThere are 882 rows and 5439 columns, where the rows represent drugs and the columns represent side effects;

1.2.2) drug side effects matrix MseAs a side effect feature of a drug, a side effect feature set of drugs is obtained:

1.2.3) feature set based on drug side effectsObtaining a drug side effect similarity network DseAny element in the network is calculated as follows:

Figure BDA0002557041800000063

wherein, Fi seRepresents the side effect characteristic of the ith drug,

Figure BDA0002557041800000064

represents the side effect characteristics of the jth drug, i 1,2, 3., 882, j 1,2, 3.· 882;

1.3) downloading drug disease data: drug disease-related databases include drug Bank and CTD databasesExample downloaded database Using, but not limited to, CTD, a drug disease similarity network D was constructeddi

1.3.1) downloading 882 medicines and 6902 medicines corresponding to the 882 medicines from the CTD database to obtain a medicine and disease matrix MdiThe matrix MdiThere are 882 rows and 6902 columns, where the rows represent drugs and the columns represent diseases;

1.3.2) drug disease matrix MdiAs the disease characteristics of a drug, obtaining a disease characteristic set of drugs:

1.3.3) drug-based disease feature setObtaining a drug disease similarity network DdiAny element in the network is calculated as follows:

wherein, Fi diRepresents the disease characteristics of the ith drug,

Figure BDA0002557041800000071

represents the disease characteristics of the jth drug, i 1,2, 3.., 882, j 1,2, 3.., 882;

1.4) downloading drug interaction data: drug interaction databases include drug Bank and CTD databases, etc., and the downloaded database in this example uses, but is not limited to drug Bank, to construct drug interaction network Ddr

1.4.1) downloading the interaction among 882 medicines from a drug bank database, and then judging whether the interaction exists between the medicine i and the medicine j:

if present, thenOtherwise

Figure BDA0002557041800000073

Finally obtaining a drug interaction similarity network Ddr

1.5) downloading protein interaction data: the protein interaction database comprises drug Bank, HPRD database and the like, and the database downloaded in the example adopts but is not limited to HPRD to construct the protein interaction network Tpr

1.5.1) download the interactions between 1449 proteins from the HPRD database and then determine if there is an interaction between protein i and protein j:

if present, thenOtherwise

Figure BDA0002557041800000075

Finally obtaining a drug interaction similarity network Tpr

1.6) downloading protein sequence data: the protein sequence database comprises drug Bank, HPRD database and the like, and the downloaded database in the example adopts but is not limited to HPRD to construct the protein sequence similarity network Tseq

1.6.1) download sequence data SEQ of 1449 proteins from the HPRD database1449And then calculating the sequence similarity network T of the protein by using the Smith-Waterman algorithmseq

1.7) downloading protein disease data: the protein disease related databases comprise drug Bank, CTD database and the like, and the downloaded database in the example adopts CTD but is not limited to construct the protein disease similarity network Tdi

1.7.1) downloading 1449 proteins and the protein data of 6902 diseases corresponding to the 1449 proteins from the CTD database to obtain a protein and disease matrix MpdiMatrix MpdiThere are 1449 rows and 6902 columns, where a row represents a drug and a column represents a disease;

1.7.2) protein disease matrix MpdiAs a protein per line ofObtaining a set of disease characteristics of the proteins:

1.7.3) protein-based disease feature setObtaining a protein disease similarity network TdiAny element in the network is calculated as follows:

wherein, Fi pdiRepresents the disease characteristics of the ith protein,represents the disease characteristics of the j-th protein, i 1,2, 3.

And 2, constructing a drug target interaction matrix P.

2.1) downloading known drug target interaction data: drug target related databases are drug bank and CTD databases, etc., and the downloaded database of the present example adopts, but is not limited to drug bank, and 3185 interaction data between 882 drugs and 1449 proteins are downloaded from the database;

2.2) using the data of these interactions to form a drug target interaction matrix P, where each row of the matrix is drug and each column is protein, and if there is an interaction between drug i and protein j, the corresponding entry P in the matrix PijIf not, then Pij=0。

Step 3, respectively calculating similarity networks Dch、Dse、Ddi、Ddr、Tpr、Tseq、TdiCorresponding diffusion state matrix

Figure BDA0002557041800000083

3.1) calculation of the network D of similarity of chemical structures of drugschInitial state matrix of

3.1.1) network of similarity to chemical Structure of drugs DchIs normalized to obtain a probability transfer matrix

3.1.2) Pair probability transfer matrix

Figure BDA0002557041800000086

Capturing network structure characteristics through a restarted random walk RWR algorithm:

wherein the content of the first and second substances,is the row vector of network node i after t steps of walk,is an initial one-hot encoding vector, the ith position of the vector is 1, the rest is 0, α is the restart probability;

3.1.3) step(s) of

Figure BDA00025570418000000810

Adding to capture the network global topology:

wherein, i ∈ [1, 2.,. 882 ]]T is the set step length, riIs the vector of the drug chemical structure similarity network i node;

3.1.4) obtaining a global topology vector r of 882 nodes from 3.1.3)iForm the firstInitial state matrix

3.2) calculating the initial state matrix

Figure BDA00025570418000000813

Diffusion state matrix of

Figure BDA00025570418000000814

The method for calculating the diffusion state matrix includes normalization, random walk, orthogonal point mutual information, etc., and the method for calculating the diffusion state matrix in this example is not limited to the method of orthogonal point mutual information PPMI, and the steps are as follows:

3.2.1) calculation of initial State matrix Using the method of orthogonal Point mutual information PPMI

Figure BDA0002557041800000091

Co-occurrence probability of nodes:

Figure BDA0002557041800000092

wherein the content of the first and second substances,

Figure BDA0002557041800000093

is in the initial state matrix of the chemical structure of the drug

Figure BDA0002557041800000094

The probability of association of nodes i and j,

Figure BDA0002557041800000095

is the co-occurrence probability of node i and node j;

3.2.2) using co-occurrence probability between any two nodes

Figure BDA0002557041800000096

Diffusion state matrix constituting chemical structure similarity network of drug

3.3) for other similarity networksRespectively repeating the steps 3.1) to 3.2) to obtain a drug interaction network D by calculationdrDiffusion state matrix ofDrug disease similarity network DdiDiffusion state matrix ofDrug side effect similarity network DseDiffusion state matrix ofProtein sequence similarity network TseqDiffusion state matrix ofProtein disease similarity network TdiDiffusion state matrix of

Figure BDA00025570418000000913

Protein interaction network TprDiffusion state matrix of

Step 4, integrating a medicament multilayer diffusion state matrix by using a multilayer network representation learning method

Figure BDA00025570418000000915

Obtaining characteristic vector X of medicine, integrating protein multilayer diffusion state matrix

Figure BDA00025570418000000917

Obtaining a feature vector Y of the target:

the multi-layer network representation learning method comprises a multi-mode deep automatic encoder MDA and a multi-layer network embedded MNE, in the embodiment, the multi-layer diffusion state matrix is integrated by adopting but not limited to an improved MDA method to obtain a feature vector, and the steps are as follows:

4.1) integration of a drug multilayer diffusion state matrix by means of an improved MDA methodObtaining a drug feature vector matrix X:

4.1.1) matrix of all diffusion states for drugs

Figure BDA00025570418000000919

Calculating the nonlinear coding embedding:

Figure BDA00025570418000000920

wherein the content of the first and second substances,

Figure BDA00025570418000000921

is a feature of the embedding that,is a matrix of coding weights that is,is a matrix of the coding offset and,

Figure BDA0002557041800000101

is a sigmoid activation function;

4.1.2) embedding all features

Figure BDA0002557041800000102

Splicing is carried out, and common characteristics H of the integrated network are calculatedc,1

Figure BDA0002557041800000103

Wherein, the [ alpha ], [ beta ]]Is a conditioned activation function, WcIs a stitching weight matrix, BcIs a stitching bias matrix;

4.1.3) pairs of common characteristics Hc,1Then, the coding conversion of the L layer is carried out to obtain the converted coding characteristics Hc,p+1Comprises the following steps:

Hc,p+1=σ(WpHc,p+Bp),

wherein p ∈ { 1., L } is the number of coding layers, WpIs a transcoding weight matrix, BpIs a transcoding offset matrix;

4.1.4) on the converted coding features Hc,L+1Calculating decoding conversion of L layer to obtain converted decoding characteristic Hc,p+L+1Comprises the following steps:

Hc,p+L+1=σ(Wp,1Hc,p+L+Bp,1),

where p ∈ { 1., L } is the number of decoding layers, Wp,1Is a decoding conversion weight matrix, Bp,1Is a decoding translation bias matrix; (ii) a

4.1.5) feature H obtained by decodingc,2L+1Calculating the decoding embedding of the diffusion state matrix of each layer of the drug network:

wherein the content of the first and second substances,

Figure BDA0002557041800000105

is a decoding embedding feature of each layer of the diffusion state matrix,

Figure BDA0002557041800000106

is a matrix of the decoding weights,is a decoding bias matrix;

4.1.6) embedding features with decodingReducing diffusion state matrix of each layer of drug network

Figure BDA00025570418000001010

Wherein the content of the first and second substances,is to restore the weight matrix to the original weight matrix,

Figure BDA00025570418000001012

is a reduction bias matrix;

4.1.7) minimizing the original diffusion State matrixAnd a reduction diffusion state matrixDifference between θ:

Figure BDA00025570418000001015

wherein l (—) is a sample-wise binary cross-entry function,

Figure BDA0002557041800000111

is the function of the minimization of the function,

is the regularization constraint item of all parameter matrixes in the encoding and decoding process;

4.1.8) after the value of theta is minimized, the difference between the restored diffusion state matrix and the original diffusion state matrix is minimized, and at the moment, the coding characteristic H of the corresponding middle layerc,L+1The topological structure of the drug in each layer of similarity network can be captured, and the characteristics of the drug can be representedCharacterisation, i.e. coding characteristics H of the intermediate layerc,L+1As a drug feature vector X;

4.2) integration of the multilayer diffusion State matrix of proteins with the improved MDA methodObtaining a target feature vector Y:

the specific implementation of this step is the same as 4.1), except that a multi-layer diffusion state matrix of proteins is input at the input end

Figure BDA0002557041800000114

Repeating 4.1.1) -4.1.8) to obtain the target feature vector Y.

Step 5, obtaining a drug target interaction prediction matrix based on the drug feature vector X, the target feature vector Y and the known drug target interaction data P

5.1) putting the drug feature vector X into a drug target space Z to obtain a projection vector XG of X in the Z space, wherein G is a drug feature vector conversion matrix;

5.2) putting the target characteristic vector Y into a medicine target space Z to obtain a projection vector YH of Y in the Z space, wherein H is a target characteristic vector conversion matrix;

5.3) according to the projection vector XG of the drug and the projection vector YH of the target, transposing the projection vector YH, i.e. (YH)TTo obtain a prediction matrix of drug target interaction

Figure BDA0002557041800000116

Wherein HTIs the transpose of the H matrix, YTIs a transposed matrix of the Y matrix;

5.4) use of the known drug target interactions P as supervisory informationMinimizing the matrix P and the prediction matrix by an alternative minimization method

Figure BDA0002557041800000118

The difference between them:

wherein, PijIs the interaction of drug i with target j, if Pij1 indicates the presence of an interaction, Pij0 indicates that no interaction is present; x is the number ofiIs the ith row eigenvector in the drug eigenvector matrix X,

Figure BDA0002557041800000121

is a target eigenvector matrix YTThe characteristic vector of the jth column of the (j),

Figure BDA0002557041800000122

calculating a Frobenius norm of a matrix; λ is a penalty parameter, when the Frobenius norm of the transformation matrices G and H is too large, the penalty is larger, otherwise, the penalty is smaller;

5.5) prediction matrix of drug target interactionWith minimal difference to the interaction matrix P of known drug targetsNamely the final drug target interaction prediction matrix.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions

Simulation experiments were performed on Intel (R) core (TM) i7-8700k CPU, host frequency 3.70GHz, memory 48G, Python 3.6.5 on Ubuntu platform.

2. Simulation content:

simulation 1, on the same data set, respectively adopting the method of the invention and the existing multi-layer network representation learning method MDA to predict the interaction of the drug target, and calculating the prediction accuracy, wherein the result is shown in FIG. 2, AUROC is the area under the receiver operating characteristic curve ROC, AUPR is the area under the accurate recall curve, AUROC and AUPR are both indexes for measuring the prediction accuracy, and the larger the value is, the higher the accuracy is.

FIG. 2 shows that the invention can effectively improve the prediction accuracy of drug target interaction.

Simulation 2, predicting potential drug target interaction by using the multi-type data of the integrated drug and protein of the invention to obtain the predicted interaction results between 882 drugs and 1449 targets, ranking the predicted results, and verifying the interaction of unknown drug targets by taking 8 which are ranked top, with the results shown in table 1.

TABLE 1

Figure BDA0002557041800000131

In table 1, the √ indicates that the drug target interaction has been validated by the validation scheme shown in the column.

From table 1, it can be seen that, in the predicted 8 pairs of drug target interactions, the drug target interactions ranked 31,41, and 43 have been verified by three ways of literature verification, disease association, and Pathway enrichment analysis, indicating that the 3 pairs of drug target interaction prediction results are most reliable, while the other 5 pairs of drug target interactions have also been verified by partial ways, so that the prediction results are more reliable, and the prediction accuracy and reliability of the present invention are demonstrated by the prediction results.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于深度卷积神经网络的DNA绑定残基预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!