Drug target interaction prediction method based on multilayer network representation learning

文档序号：1075097 发布日期：2020-10-16 浏览：8次中文

阅读说明：本技术 基于多层网络表示学习的药物靶标相互作用预测方法 (Drug target interaction prediction method based on multilayer network representation learning ) 是由鱼亮尚奕帆于 2020-06-28 设计创作，主要内容包括：本发明公开了一种基于多层网络表示学习的药物靶标相互作用预测方法,主要解决现有技术预测准确率低的问题。其方案是：从药物和蛋白质数据库中下载数据,分别构建药物和蛋白质的多层相似性网络；对这两种相似性网络分别计算其扩散状态,并分别整合各自扩散状态得到药物和蛋白质的特征向量；将已知的药物靶标相互作用数据作为监督信息,把药物和蛋白质特征向量投入到同一药物靶标空间中,使用双线性函数分别得到药物和蛋白质的投影矩阵；根据这两个投影矩阵得到药物靶标相互作用的预测得分矩阵并对其排名；把排名靠前的8对未知药物靶标对视作潜在的药物靶标相互作用。本发明提高了药物靶标相互作用的预测准确率,可用于预测药物靶标对的候选。(The invention discloses a medicine target interaction prediction method based on multilayer network representation learning, and mainly solves the problem of low prediction accuracy in the prior art. The scheme is as follows: downloading data from a drug and protein database, and respectively constructing a multilayer similarity network of the drug and the protein; respectively calculating the diffusion states of the two similarity networks, and respectively integrating the respective diffusion states to obtain the characteristic vectors of the drug and the protein; the known drug target interaction data is used as supervision information, the drug and protein characteristic vectors are put into the same drug target space, and projection matrixes of the drug and the protein are respectively obtained by using a bilinear function; obtaining a prediction score matrix of the interaction of the drug targets according to the two projection matrices and ranking the prediction score matrix; the top 8 pairs of unknown drug target pairs were targeted for potential drug target interactions. The method improves the prediction accuracy of the drug target interaction, and can be used for predicting the candidate of the drug target pair.)

1. A method for predicting drug target interaction based on multilayer network representation learning comprises the following steps:

(1) downloading data of the drug and the protein, and respectively constructing a drug similarity network and a protein similarity network:

(1a) downloading structure data CH of n drugs from any one database related to chemical structures of drugs_nConstruction of a network of similarity of chemical structures of drugs D^ch；

(1b) Downloading n medicines and medicine side effect data corresponding to the n medicines from any one database related to the medicine side effect to obtain a medicine and side effect matrix M^seConstruction of drug side-effect similarity network D^se；

(1c) Downloading n medicines and medicine-related disease data corresponding to the n medicines from any database related to medicine diseases to obtain a medicine-disease relation matrix M^diConstruction of drug disease similarity network D^di；

(1d) Downloading n drugs and their interactions from any one database related to drug interactions, and constructing a drug-drug interaction similarity network D^dr；

(1e) Downloading m drugs and the interaction among the m proteins from any one database related to the protein interaction, and constructing a protein-protein interaction similarity network T^pr；

(1f) Downloading of sequence data SEQ of m proteins from any database related to protein sequences_mConstruction of protein sequence similarity network T^seq；

(1g) Downloading M proteins and relevant disease data corresponding to the M proteins from any database related to protein diseases to obtain a protein-disease relation matrix M^pdiConstruction of protein disease similarity network T^di；

(2) Downloading known drug target interaction data between n drugs and m proteins from any one database related to the drug targets to obtain a drug target interaction matrix P;

(3) calculating the network diffusion state for the similarity network:

(3a) 7 layers of similarity networks constructed based on (1) are respectively and independently captured by using RWR (re-starting random walk algorithm) to acquire similarity networks of each layerTo obtain a network D of similarity of chemical structures of the drugs^chInitial state matrix of

(3b) Using the orthogonal point mutual information PPMI method to align the initial state matrix The node calculates the co-occurrence probability to obtain the corresponding diffusion state matrix

(4) Using a multilayer network representation learning method to obtain feature vectors of the drug and the target:

(4a) diffusion state matrix for chemical structure of medicine by using multilayer network representation learning method

(4b) diffusion state matrix for protein sequence by using multilayer network representation learning method

(5) simultaneously putting the drug characteristic vector X and the target characteristic vector Y into a drug target space Z, taking a drug target interaction matrix P as supervision information, and minimizing the drug target by using an alternate minimization methodInteraction prediction matrixThe difference between the obtained matrix and the known matrix P is used for obtaining a final medicine target interaction prediction matrixThe entries in the matrix are the predicted drug target interaction scores, completing the prediction of drug target interactions.

2. The method of claim 1, wherein a network of similarity of chemical structures of drugs D is constructed in (1a)^chThe implementation is as follows:

1a1) downloading 882 drug chemical structure data CH from drug-related database₈₈₂；

1a2) Chemical structure data CH based on drugs₈₈₂Obtaining a SMILES chemical structural formula feature vector of the drug by using a tool package rcdk of an R language;

1a3) obtaining a drug chemical structure similarity network D by using a R language toolkit finger print based on the feature vector of the SMILES chemical structural formula of the drug^chCalculating any one element [ D ] in the network^ch]_ij：

Wherein, A and B are respectively the characteristic vectors of the chemical structural formulas of two different medicines, representing the inner product operation of the vectors, | | | | | purple sweet²Representing the operation of the vector modulo.

3. The method of claim 1, wherein a drug side effect similarity network D is constructed in (1b)^seThe implementation is as follows:

1b1) downloading 882 medicaments and 5439 side effect medicament data corresponding to the 882 medicaments from a medicament side effect database to obtain a medicament and side effect matrix M^seThe matrix M^seThere are 882 rows and 5439 columns, where the rows represent drugs and the columns represent side effects;

1b2) drug side effect matrix M^seAs a side effect feature of a drug, a side effect feature set of drugs is obtained:

1b3) drug side effect based feature setObtaining a drug side effect similarity network D^seAny element in the network is calculated as follows:

wherein, F_i ^seRepresenting the characteristic vector of the side effect of the ith drug,represents the side effect characteristics of the jth drug, i 1,2, 3.., 882, j 1,2, 3.., 882.

4. The method of claim 1, wherein a drug disease similarity network D is constructed in (1c)^diThe implementation is as follows:

1c1) downloading 882 medicaments and medicament disease data of 6902 medicaments corresponding to the 882 medicaments from a medicament disease related database to obtain a medicament and disease matrix M^diMatrix M^diThere are 882 rows and 6902 columns, where the rows represent drugs and the columns represent diseases;

1c2) drug disease matrix M^diAs the disease characteristics of a drug, obtaining a disease characteristic set of drugs:

1c3) base ofIn the drug disease feature set

wherein, F_i ^diRepresents the disease characteristics of the ith drug,represents the disease characteristics of the j-th drug, i 1,2, 3.., 882, j 1,2, 3.., 882.

5. The method of claim 1, wherein:

construction of drug interaction network D in (1D)^drFirstly downloading the interaction among 882 medicines from a medicine related database, and then judging whether the interaction exists between the medicine i and the medicine j: if present, then

Construction of protein interaction network T in said (1e)^prFirst, 1449 interactions between proteins are downloaded from a protein-related database, and then whether an interaction exists between protein i and protein j is determined: if present, then

Constructing a protein sequence similarity network T in the step (1f)^seqThe method is to download 1449 protein sequence data SEQ from a protein-related database₁₄₄₉And then calculating the sequence similarity network T of the protein by utilizing the Smith-Waterman algorithm^seq。

6. The method of claim 1, wherein step (1g) comprises constructing a protein disease similarity network T^diThe implementation is as follows:

1g1) downloading 1449 proteins and 6902 protein data of diseases corresponding to the 1449 proteins from a protein disease related database to obtain a protein and disease matrix M^pdiThe matrix M^pdiThere are 1449 rows and 6902 columns, where rows represent proteins and columns represent diseases;

1g2) protein disease matrix M^pdiAs a disease signature of a protein, a set of disease signatures of proteins is obtained:

1g3) protein-based disease feature setObtaining a protein disease similarity network T^diAny element in the network is calculated as follows:

wherein, F_i ^pdiA disease feature vector representing the ith protein,represents the disease characteristics of the j-th protein, i 1,2, 3.

7. The method according to claim 1, wherein the initial state matrix of each layer of similarity network is calculated in (3a) by using random walk algorithm RWR to obtain the similarity network D of the chemical structure of the drug^chFor example, the following is implemented:

3a1) for similarity network D^chIs normalized to obtain a probability transfer matrix

3a2) For probability transition matrixCapturing network structure characteristics through a restarted random walk RWR algorithm:

wherein the content of the first and second substances,is the row vector of network node i after t steps of walk,is an initial one-hot encoding vector, the ith position of the vector is 1, the rest is 0, α is the restart probability;

3a3) within t steps

wherein, i ∈ [1, 2.,. 882 ]]T is the set step length, r_iIs the vector of the drug chemical structure similarity network i node;

3a4) the global topological structure vector r of the 882 nodes obtained in the previous step_iForm an initial state matrix R^ch。

8. The method of claim 1, wherein the diffusion state matrix of the initial state matrix is calculated in (3b) by using the method of orthogonal point mutual information PPMI to obtain the initial state matrix R of the chemical structure of the drug^chFor example, the following is implemented:

3b1) for initial state matrix R^chCalculating the co-occurrence probability of the nodes by using the orthogonal point mutual information PPMI method:

wherein the content of the first and second substances,is the initial state matrix R of the chemical structure of the drug^chThe probability of association of the intermediate nodes i and j,co-occurrence probability of nodes i and j;

3b2) by co-occurrence probability between any two nodesDiffusion state matrix constituting chemical structure similarity network of drug

9. The method of claim 1, wherein (4a) integrating the drug multilayer diffusion state matrix using a multilayer network representation learning method to obtain a drug feature vector matrix X is implemented as follows:

4a1) matrix of all diffusion states for drugs

wherein the content of the first and second substances,

4a2) embedding all features

Wherein, the [ alpha ], [ beta ]]Is a conditioned activation function, W_cIs a stitching weight matrix, B_cIs a stitching bias matrix;

4a3) for common feature H_c,1Then, the coding conversion of the L layer is carried out to obtain the converted coding characteristics H_c,p+1Comprises the following steps:

H_c,p+1＝σ(W_pH_c,p+B_p)，

wherein p ∈ { 1., L } is the number of coding layers, W_pIs a transcoding weight matrix, B_pIs a transcoding offset matrix;

4a4) for the converted coding characteristics H_c,L+1Calculating decoding conversion of L layer to obtain converted decoding characteristic H_c,p+L+1Comprises the following steps:

H_c,p+L+1＝σ(W_p,1H_c,p+L+B_p,1)，

where p ∈ { 1., L } is the number of decoding layers, W_p,1Is a decoding conversion weight matrix, B_p,1Is a decoding translation bias matrix; (ii) a

4a5) Using the converted decoding characteristics H_c,2L+1Calculating the decoding embedding of the diffusion state matrix of each layer of the drug network:

wherein the content of the first and second substances,

4a6) embedding features using decodingReducing diffusion state matrix of each layer of drug network

Wherein the content of the first and second substances,is to restore the weight matrix to the original weight matrix,

4a7) minimizing raw diffusion state matrixAnd a reduction diffusion state matrixDifference between θ:

wherein l (—) is a sample-wise binary cross-entry function,is the function of the minimization of the function,is the regularization constraint item of all parameter matrixes in the encoding and decoding process;

4a8) after the theta is taken as the minimum value, the difference between the restored diffusion state matrix and the original diffusion state matrix is minimum, and at the moment, the coding characteristics H of the corresponding middle layer_c,L+1The topological structure of the drug in each layer of similarity network can be captured, and the characteristic of the drug, namely the coding characteristic H of the middle layer can be represented_c,L+1As the drug feature vector X.

10. The method of claim 1, wherein the drug target interaction prediction matrix obtained in (5) is achieved as follows:

5a) putting the drug feature vector X into a drug target space Z to obtain a projection vector XG of X in the Z space, wherein G is a drug feature vector conversion matrix;

5b) putting the target characteristic vector Y into a drug target space Z to obtain a projection vector YH of Y in the Z space, wherein H is a target characteristic vector conversion matrix;

5c) according to the medicineThe projection vector XG of the object and the projection vector YH of the target, and transposing the projection vector YH (YH)^TTo obtain a prediction matrix of drug target interaction

Wherein H^TIs the transpose of the H matrix, Y^TIs a transposed matrix of the Y matrix;

5d) minimizing the matrix P and the prediction matrix by using an alternative minimization method by using the known drug target interaction P as supervision information

wherein, P_ijIs the interaction of drug i with target j, if P_ij1 indicates the presence of an interaction, P_ij0 indicates that no interaction is present; x is the number of_iIs the ith row eigenvector in the drug eigenvector matrix X,is a target eigenvector matrix Y^TThe characteristic vector of the jth column of the (j),calculating Frobenius norm of a matrix, wherein lambda is a penalty parameter, and when the Frobenius norm of a conversion matrix G and H is too large, the penalty is larger, otherwise, the penalty is smaller;

5e) drug target interaction prediction matrixWith minimal difference to the interaction matrix P of known drug targets

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a drug target interaction prediction method which can be used for providing candidate drug target interaction in a drug relocation experiment.

Background

The drug target refers to biological macromolecules which have a drug effect function in vivo and can be acted by drugs, such as certain biological macromolecules of proteins, nucleic acids and the like. The drug target interaction means that drug molecules are combined with biological macromolecules, namely proteins, in a human body and play a role. If a drug is bound to different target proteins, the effect of the drug will vary, and if the predicted drug target is associated with a disease, the drug may have a potential therapeutic effect on the disease.

The prediction of drug target interactions is an important step in drug relocation, and the purpose is to predict proteins, i.e. drug targets, on which drugs may act, and further to discover potential therapeutic effects of drugs. Therefore, more drug target interactions can help people to improve understanding in pharmacology and fully exert more curative effects of the drugs. If a new target of the drug is determined and new application of the drug is found, the cost of drug research and development can be greatly reduced, the period is shortened, and the risk that the side effect of the new drug cannot pass clinical examination is reduced.

Drug target interactions are determined experimentally with reliable results but at a high cost. Therefore, it is difficult to determine each pair of drug-target interactions using only experimental methods. Therefore, it is necessary to predict drug target interaction by a calculation method, so as to reduce the experimental range, reduce the cost and shorten the time.

These existing calculation methods are mainly based on the following assumptions: similar drugs may have the same target and vice versa. Existing calculation methods can be divided into two categories according to the type of data used: a model for predicting drug target interaction based on single-type data and based on multi-type data integration:

prediction model based on single type data

The prediction model based on single type data has the following algorithm according to different data types: chemical structure, drug side effects, gene expression data, etc.

The drugs as compound molecules all have chemical structures, and different chemical structures have different effects, so that the similarity relation between the drugs can be described through the chemical structures of the drugs. The chemical structural molecular formula of the medicine is decomposed, and a high-dimensional vector is used for representing the chemical structural characteristics of the medicine. For example, one dimension indicates whether the chemical structure of the drug contains a benzene ring, and if so, 1 is labeled, whereas 0 is labeled. Therefore, the chemical structure of each drug is represented by a vector, and the similarity between the drugs is measured by calculating the distance between the drug vectors, so as to predict the target of the drug.

The side effects of the drugs also contain important clinical phenotypic information, and the similarity of the phenotypic side effects is used to deduce whether the two drugs have a common target. All side effect reactions are represented as a non-repeating list using the side effect event reporting system AERS of the U.S. food and drug administration FDA. And a high-dimensional vector is used for representing the side effect characteristics of the medicine, if the mark is 1, the medicine is provided with the adverse reaction in the side effect list corresponding to the position, and if the mark is 0, the medicine is provided with no adverse reaction. The final drug can be represented as a 0,1 feature vector, and the distance between vectors is calculated to measure similarity.

When the drug acts, the change of in vivo gene expression is caused, which is an important characteristic in transcriptomics, so that the drug target interaction can be predicted by using the characteristic of gene expression profile change caused by drug treatment. The CMap is a gene expression profile database under the interference of high-flux compounds, gene expression data under the action of CMap drugs are provided in the database, and a machine learning classification technology is adopted, so that the gene expression data can be independently used for predicting the drug target interaction.

However, the similarity calculated using such single-type data includes bias, while multi-type data includes different information, and many studies have been proposed to integrate multi-type data for drug target interaction prediction. Compared with a method based on single-type data, information supplement exists among the multi-type data, and therefore integration of the multi-type data can finally improve the accuracy rate of predicting drug target interaction.

Prediction model based on multi-type data integration

To integrate multiple types of data, there are roughly two types of approaches that people use, respectively, web-based approaches and machine learning-based approaches.

The network-based method is to construct a plurality of relationship networks according to different types of data, and then integrate multi-type information by using the diffusion of the information in the networks. The method firstly constructs a plurality of medicine similarity networks according to different types of data of medicines, the similarity networks describe the relationship among the medicines from multiple angles, and then fuses the similarity networks by using a network fusion method such as SNF. Unlike the conventional method, the method is a nonlinear fusion, so that a plurality of similar networks form an information complementation. And a random walk method based on the network is adopted, the core of the random walk is diffusion without any rule, and finally, the relationship among the network nodes can be stabilized in a probability distribution. The traditional random walk method is in a homogeneous network, and two networks for interacting the drug and target similar networks with known drug targets are combined in the prior art, and a bidirectional random walk method is adopted in a heterogeneous network to conjecture potential drug target interaction.

The machine learning-based method learns the characteristics of the drug and the target, and then predicts the drug target interaction according to the characteristics. The method for learning the characteristics comprises a method for decomposing a cooperation matrix based on a homogeneous network; and a Laplace regularization sparse subspace learning LRSSL method is also used, and a Laplace regularization item is added to satisfy the smoothness of the subspace. In the method, a plurality of heterogeneous networks constructed by different types of data are projected into a common feature space, so that a plurality of networks can be integrated into one network, and then features are learned from a single integrated network.

However, these methods of integrating multiple types of data do not integrate data well. Firstly, by a network fusion method, the diffusion state is directly used as a feature or a prediction score, and the influence of noise in different networks is easy to be caused. Second, integrating multiple networks into one network or investing in a common subspace may result in the loss of different network-specific information, since information from multiple data sources is mixed and indistinguishable. These factors all affect the accuracy of predicting drug target interactions.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a medicine target interaction prediction method based on multilayer network representation learning, so that the loss of different network specificity information and the noise of a multi-type data network are reduced, and the prediction accuracy is improved.

The technical idea of the invention is as follows: constructing a plurality of similarity networks by using multi-set chemical data of the drugs and the proteins, and calculating the topological structure characteristics of the diffusion state capture network of each similarity network; integrating a plurality of network diffusion states by using a multilayer network representation learning method, learning the characteristic vectors of the medicine and the target, putting the characteristic vectors of the medicine and the target into a medicine-target space, and predicting the medicine target interaction score by using a matrix completion method.

According to the above thought, the implementation steps of the invention include the following:

(1) downloading data of the drug and the protein, and respectively constructing a drug similarity network and a protein similarity network:

(1a) downloading structure data CH of n drugs from any one database related to chemical structures of drugs_nConstruction of a network of similarity of chemical structures of drugs D^ch；

(1b) Downloading n drugs and drug side effect data of side effects corresponding to the n drugs from any one database related to drug side effects to obtain a drug and side effect matrix M^seConstruction of drug side-effect similarity network D^se；

(1d) Downloading n drugs and phases between the n drugs from any one of the databases associated with drug interactionsInteraction, construction of drug-drug interaction similarity network D^dr；

(1e) Downloading m drugs and the interaction among the m proteins from any one database related to the protein interaction, and constructing a protein-protein interaction similarity network T^pr；

(1f) Downloading of sequence data SEQ of m proteins from any database related to protein sequences_mConstruction of protein sequence similarity network T^seq；

(2) Downloading known drug target interaction data between n drugs and m proteins from any one database related to the drug targets to obtain a drug target interaction matrix P;

(3) calculating the network diffusion state for the similarity network:

(3a) respectively and independently capturing the topological structure of each layer of similarity network by using a re-starting random walk algorithm RWR for 7 layers of similarity networks constructed based on the step (1) to obtain a medicine chemical structure similarity network D^chInitial state matrix of

Drug side effect similarity network D^seInitial state matrix of

Interaction similarity network between drugs D^drInitial state matrix of

Drug disease similarity network D^diInitial state matrix ofProtein sequence similarityT of sexual network^seqInitial state matrix of

Interaction similarity network T between proteins^seqInitial state matrix of

Protein disease similarity network T^diInitial state matrix of

(3b) Using the orthogonal point mutual information PPMI method to align the initial state matrix

The node calculates the co-occurrence probability to obtain the corresponding diffusion state matrix

(4) Using a multilayer network representation learning method to obtain feature vectors of the drug and the target:

(4a) diffusion state matrix for chemical structure of medicine by using multilayer network representation learning methodDrug side effect diffusion state matrix

Drug interaction diffusion state matrix

And drug disease spreading status matrix

Integrating to obtain a drug characteristic vector matrix X;

(4b) diffusion state matrix for protein sequence by using multilayer network representation learning methodProtein interaction diffusion state matrixProtein disease spreading status matrix

Integrating to obtain a target characteristic vector matrix Y;

(5) simultaneously putting the drug characteristic vector X and the target characteristic vector Y into a drug target space Z, taking a drug target interaction matrix P as supervision information, and minimizing a drug target interaction prediction matrix by using an alternate minimization methodThe difference between the obtained matrix and the known matrix P is used for obtaining a final medicine target interaction prediction matrix

The entries in the matrix are the predicted drug target interaction scores, completing the prediction of drug target interactions.

Compared with the prior art, the invention has the following advantages:

1. the invention predicts the interaction of the drug targets by integrating multi-type data, can describe the similarity between the drugs and the targets from multiple angles, avoids the defect of bias existing in the existing method adopting single-type data, and improves the accuracy of predicting the interaction of the drug targets.

2. According to the method, the multilayer similarity network is integrated by using a multilayer network representation learning method, the influence of multilayer network noise is reduced according to the denoising characteristic of the automatic encoder, and the accuracy of predicting the interaction of the drug targets is improved.

3. The invention uses multilayer network representation learning method to generate the characteristic vector of the drug and the protein, and can capture the nonlinear conversion characteristic of the network topology structure to obtain the high-quality drug and protein characteristic vector.

4. According to the method, the multi-layer network representation learning method is used for training data to predict the interaction of the drug targets, so that the risk of overfitting of the training data due to excessive parameters can be prevented, and the accuracy of predicting the interaction of the drug targets is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph comparing the accuracy of the present invention in predicting drug target interactions with the prior art multi-layer network representation learning method MDA.

Detailed Description

The following detailed description of specific embodiments and effects of the present invention is provided in conjunction with the accompanying drawings:

referring to fig. 1, the implementation steps of this example are as follows:

step 1, downloading data of drugs and proteins, and constructing a drug similarity network and a protein similarity network.

1.1) downloading the chemical structure data of the medicine: the related databases of the chemical structure of the medicine comprise a drug Bank database, a CTD database and the like, and the downloaded database of the embodiment adopts but is not limited to the drug Bank database to construct the similarity network D of the chemical structure of the medicine^ch：

1.1.1) downloading 882 drugs' chemical structure data CH from DrugBank database₈₈₂；

1.1.2) chemical Structure data CH based on drugs₈₈₂Obtaining a SMILES chemical structural formula feature vector of the drug by using a tool package rcdk of an R language;

1.1.3) obtaining a drug chemical structure similarity network D based on the SMILES chemical structural formula feature vector of the drug by using a toolkit finger print of R language^chCalculating any one element [ D ] in the network^ch]_ij：

Wherein, A and B are characteristic vectors of chemical structures of the medicine, representing inner product operation of the vectors, | | | | purple sweet²An operation representing a vector norm;

1.2) downloading the side effect data of the medicine: the database related to the side effect of the drug comprises a SIDER database, a CTD database and the like, the database downloaded in the embodiment adopts, but is not limited to the SIDER database, and a network D for constructing similarity of the side effect of the drug is constructed^se：

1.2.1) downloading 882 medicines and 5439 side effects corresponding to the 882 medicines from the SIDER database to obtain a medicine and side effect matrix M^seThe matrix M^seThere are 882 rows and 5439 columns, where the rows represent drugs and the columns represent side effects;

1.2.2) drug side effects matrix M^seAs a side effect feature of a drug, a side effect feature set of drugs is obtained:

1.2.3) feature set based on drug side effectsObtaining a drug side effect similarity network D^seAny element in the network is calculated as follows:

wherein, F_i ^seRepresents the side effect characteristic of the ith drug,

represents the side effect characteristics of the jth drug, i 1,2, 3., 882, j 1,2, 3.· 882;

1.3) downloading drug disease data: drug disease-related databases include drug Bank and CTD databasesExample downloaded database Using, but not limited to, CTD, a drug disease similarity network D was constructed^di：

1.3.1) downloading 882 medicines and 6902 medicines corresponding to the 882 medicines from the CTD database to obtain a medicine and disease matrix M^diThe matrix M^diThere are 882 rows and 6902 columns, where the rows represent drugs and the columns represent diseases;

1.3.2) drug disease matrix M^diAs the disease characteristics of a drug, obtaining a disease characteristic set of drugs:

1.3.3) drug-based disease feature setObtaining a drug disease similarity network D^diAny element in the network is calculated as follows:

wherein, F_i ^diRepresents the disease characteristics of the ith drug,

represents the disease characteristics of the jth drug, i 1,2, 3.., 882, j 1,2, 3.., 882;

1.4) downloading drug interaction data: drug interaction databases include drug Bank and CTD databases, etc., and the downloaded database in this example uses, but is not limited to drug Bank, to construct drug interaction network D^dr：

1.4.1) downloading the interaction among 882 medicines from a drug bank database, and then judging whether the interaction exists between the medicine i and the medicine j:

if present, thenOtherwise

Finally obtaining a drug interaction similarity network D^dr；

1.5) downloading protein interaction data: the protein interaction database comprises drug Bank, HPRD database and the like, and the database downloaded in the example adopts but is not limited to HPRD to construct the protein interaction network T^pr：

1.5.1) download the interactions between 1449 proteins from the HPRD database and then determine if there is an interaction between protein i and protein j:

if present, thenOtherwise

Finally obtaining a drug interaction similarity network T^pr；

1.6) downloading protein sequence data: the protein sequence database comprises drug Bank, HPRD database and the like, and the downloaded database in the example adopts but is not limited to HPRD to construct the protein sequence similarity network T^seq：

1.6.1) download sequence data SEQ of 1449 proteins from the HPRD database₁₄₄₉And then calculating the sequence similarity network T of the protein by using the Smith-Waterman algorithm^seq；

1.7) downloading protein disease data: the protein disease related databases comprise drug Bank, CTD database and the like, and the downloaded database in the example adopts CTD but is not limited to construct the protein disease similarity network T^di：

1.7.1) downloading 1449 proteins and the protein data of 6902 diseases corresponding to the 1449 proteins from the CTD database to obtain a protein and disease matrix M^pdiMatrix M^pdiThere are 1449 rows and 6902 columns, where a row represents a drug and a column represents a disease;

1.7.2) protein disease matrix M^pdiAs a protein per line ofObtaining a set of disease characteristics of the proteins:

1.7.3) protein-based disease feature setObtaining a protein disease similarity network T^diAny element in the network is calculated as follows:

wherein, F_i ^pdiRepresents the disease characteristics of the ith protein,represents the disease characteristics of the j-th protein, i 1,2, 3.

And 2, constructing a drug target interaction matrix P.

2.1) downloading known drug target interaction data: drug target related databases are drug bank and CTD databases, etc., and the downloaded database of the present example adopts, but is not limited to drug bank, and 3185 interaction data between 882 drugs and 1449 proteins are downloaded from the database;

2.2) using the data of these interactions to form a drug target interaction matrix P, where each row of the matrix is drug and each column is protein, and if there is an interaction between drug i and protein j, the corresponding entry P in the matrix P_ijIf not, then P_ij＝0。

Step 3, respectively calculating similarity networks D^ch、D^se、D^di、D^dr、T^pr、T^seq、T^diCorresponding diffusion state matrix

3.1) calculation of the network D of similarity of chemical structures of drugs^chInitial state matrix of

3.1.1) network of similarity to chemical Structure of drugs D^chIs normalized to obtain a probability transfer matrix

3.1.2) Pair probability transfer matrix

Capturing network structure characteristics through a restarted random walk RWR algorithm:

3.1.3) step(s) of

Adding to capture the network global topology:

wherein, i ∈ [1, 2.,. 882 ]]T is the set step length, r_iIs the vector of the drug chemical structure similarity network i node;

3.1.4) obtaining a global topology vector r of 882 nodes from 3.1.3)_iForm the firstInitial state matrix

3.2) calculating the initial state matrix

Diffusion state matrix of

The method for calculating the diffusion state matrix includes normalization, random walk, orthogonal point mutual information, etc., and the method for calculating the diffusion state matrix in this example is not limited to the method of orthogonal point mutual information PPMI, and the steps are as follows:

3.2.1) calculation of initial State matrix Using the method of orthogonal Point mutual information PPMI

Co-occurrence probability of nodes:

wherein the content of the first and second substances,

is in the initial state matrix of the chemical structure of the drug

The probability of association of nodes i and j,

is the co-occurrence probability of node i and node j;

3.2.2) using co-occurrence probability between any two nodes

Diffusion state matrix constituting chemical structure similarity network of drug

3.3) for other similarity networksRespectively repeating the steps 3.1) to 3.2) to obtain a drug interaction network D by calculation^drDiffusion state matrix ofDrug disease similarity network D^diDiffusion state matrix ofDrug side effect similarity network D^seDiffusion state matrix ofProtein sequence similarity network T^seqDiffusion state matrix ofProtein disease similarity network T^diDiffusion state matrix of

Protein interaction network T^prDiffusion state matrix of

Step 4, integrating a medicament multilayer diffusion state matrix by using a multilayer network representation learning method

Obtaining characteristic vector X of medicine, integrating protein multilayer diffusion state matrix

Obtaining a feature vector Y of the target:

the multi-layer network representation learning method comprises a multi-mode deep automatic encoder MDA and a multi-layer network embedded MNE, in the embodiment, the multi-layer diffusion state matrix is integrated by adopting but not limited to an improved MDA method to obtain a feature vector, and the steps are as follows:

4.1) integration of a drug multilayer diffusion state matrix by means of an improved MDA methodObtaining a drug feature vector matrix X:

4.1.1) matrix of all diffusion states for drugs

Calculating the nonlinear coding embedding:

wherein the content of the first and second substances,

is a feature of the embedding that,is a matrix of coding weights that is,is a matrix of the coding offset and,

is a sigmoid activation function;

4.1.2) embedding all features

Splicing is carried out, and common characteristics H of the integrated network are calculated_c,1：

Wherein, the [ alpha ], [ beta ]]Is a conditioned activation function, W_cIs a stitching weight matrix, B_cIs a stitching bias matrix;

4.1.3) pairs of common characteristics H_c,1Then, the coding conversion of the L layer is carried out to obtain the converted coding characteristics H_c,p+1Comprises the following steps:

H_c,p+1＝σ(W_pH_c,p+B_p)，

wherein p ∈ { 1., L } is the number of coding layers, W_pIs a transcoding weight matrix, B_pIs a transcoding offset matrix;

4.1.4) on the converted coding features H_c,L+1Calculating decoding conversion of L layer to obtain converted decoding characteristic H_c,p+L+1Comprises the following steps:

H_c,p+L+1＝σ(W_p,1H_c,p+L+B_p,1)，

where p ∈ { 1., L } is the number of decoding layers, W_p,1Is a decoding conversion weight matrix, B_p,1Is a decoding translation bias matrix; (ii) a

4.1.5) feature H obtained by decoding_c,2L+1Calculating the decoding embedding of the diffusion state matrix of each layer of the drug network:

wherein the content of the first and second substances,

is a decoding embedding feature of each layer of the diffusion state matrix,

is a matrix of the decoding weights,is a decoding bias matrix;

4.1.6) embedding features with decodingReducing diffusion state matrix of each layer of drug network

Wherein the content of the first and second substances,is to restore the weight matrix to the original weight matrix,

is a reduction bias matrix;

4.1.7) minimizing the original diffusion State matrixAnd a reduction diffusion state matrixDifference between θ:

wherein l (—) is a sample-wise binary cross-entry function,

is the function of the minimization of the function,

is the regularization constraint item of all parameter matrixes in the encoding and decoding process;

4.1.8) after the value of theta is minimized, the difference between the restored diffusion state matrix and the original diffusion state matrix is minimized, and at the moment, the coding characteristic H of the corresponding middle layer_c,L+1The topological structure of the drug in each layer of similarity network can be captured, and the characteristics of the drug can be representedCharacterisation, i.e. coding characteristics H of the intermediate layer_c,L+1As a drug feature vector X;

4.2) integration of the multilayer diffusion State matrix of proteins with the improved MDA methodObtaining a target feature vector Y:

the specific implementation of this step is the same as 4.1), except that a multi-layer diffusion state matrix of proteins is input at the input end

Repeating 4.1.1) -4.1.8) to obtain the target feature vector Y.

Step 5, obtaining a drug target interaction prediction matrix based on the drug feature vector X, the target feature vector Y and the known drug target interaction data P

5.1) putting the drug feature vector X into a drug target space Z to obtain a projection vector XG of X in the Z space, wherein G is a drug feature vector conversion matrix;

5.2) putting the target characteristic vector Y into a medicine target space Z to obtain a projection vector YH of Y in the Z space, wherein H is a target characteristic vector conversion matrix;

5.3) according to the projection vector XG of the drug and the projection vector YH of the target, transposing the projection vector YH, i.e. (YH)^TTo obtain a prediction matrix of drug target interaction

Wherein H^TIs the transpose of the H matrix, Y^TIs a transposed matrix of the Y matrix;

5.4) use of the known drug target interactions P as supervisory informationMinimizing the matrix P and the prediction matrix by an alternative minimization method

The difference between them:

is a target eigenvector matrix Y^TThe characteristic vector of the jth column of the (j),

calculating a Frobenius norm of a matrix; λ is a penalty parameter, when the Frobenius norm of the transformation matrices G and H is too large, the penalty is larger, otherwise, the penalty is smaller;

5.5) prediction matrix of drug target interactionWith minimal difference to the interaction matrix P of known drug targetsNamely the final drug target interaction prediction matrix.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions

Simulation experiments were performed on Intel (R) core (TM) i7-8700k CPU, host frequency 3.70GHz, memory 48G, Python 3.6.5 on Ubuntu platform.

2. Simulation content:

simulation 1, on the same data set, respectively adopting the method of the invention and the existing multi-layer network representation learning method MDA to predict the interaction of the drug target, and calculating the prediction accuracy, wherein the result is shown in FIG. 2, AUROC is the area under the receiver operating characteristic curve ROC, AUPR is the area under the accurate recall curve, AUROC and AUPR are both indexes for measuring the prediction accuracy, and the larger the value is, the higher the accuracy is.

FIG. 2 shows that the invention can effectively improve the prediction accuracy of drug target interaction.

Simulation 2, predicting potential drug target interaction by using the multi-type data of the integrated drug and protein of the invention to obtain the predicted interaction results between 882 drugs and 1449 targets, ranking the predicted results, and verifying the interaction of unknown drug targets by taking 8 which are ranked top, with the results shown in table 1.

TABLE 1

In table 1, the √ indicates that the drug target interaction has been validated by the validation scheme shown in the column.

From table 1, it can be seen that, in the predicted 8 pairs of drug target interactions, the drug target interactions ranked 31,41, and 43 have been verified by three ways of literature verification, disease association, and Pathway enrichment analysis, indicating that the 3 pairs of drug target interaction prediction results are most reliable, while the other 5 pairs of drug target interactions have also been verified by partial ways, so that the prediction results are more reliable, and the prediction accuracy and reliability of the present invention are demonstrated by the prediction results.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于深度卷积神经网络的DNA绑定残基预测方法

Drug target interaction prediction method based on multilayer network representation learning

相关技术

网友询问留言