End-to-end learning-based compound and protein interaction and affinity prediction method

文档序号:1923592 发布日期:2021-12-03 浏览:33次 中文

阅读说明:本技术 一种基于端到端学习的化合物和蛋白质相互作用与亲和力预测方法 (End-to-end learning-based compound and protein interaction and affinity prediction method ) 是由 李敏 卢长利 于 2021-09-06 设计创作,主要内容包括:本发明公开了一种基于端到端学习的化合物和蛋白质相互作用与亲和力预测方法,包括:将化合物的分子式转换为原子邻接图,使用图注意力网络学习化合物每个原子的表征向量;将蛋白质氨基酸序列切分为残基序列,使用卷积神经网络模型学习残基的表征向量;构建双向注意力网络模型来融合所有原子和残基的表征向量,得到化合物特征向量和蛋白质特征向量;使用神经网络并根据化合物和蛋白质的特征向量,对化合物与蛋白质之间的相互作用以及亲和力进行预测。本发明既可以用于化合物和蛋白质相互作用的预测,又能够预测二者之间的结合亲和力,而且预测准确性好。(The invention discloses a method for predicting interaction and affinity of a compound and a protein based on end-to-end learning, which comprises the following steps: converting the molecular formula of the compound into an atom adjacency graph, and learning a characterization vector of each atom of the compound by using an attention network; segmenting a protein amino acid sequence into a residue sequence, and learning a representation vector of a residue by using a convolutional neural network model; constructing a bidirectional attention network model to fuse the characterization vectors of all atoms and residues to obtain a compound characteristic vector and a protein characteristic vector; the interaction and affinity between the compound and the protein are predicted using a neural network and based on the feature vectors of the compound and the protein. The method can be used for predicting the interaction between the compound and the protein, can also be used for predicting the binding affinity between the compound and the protein, and has good prediction accuracy.)

1. A method for predicting compound and protein interactions and affinities based on end-to-end learning, comprising:

obtaining a molecular formula of a compound, converting the molecular formula into an atom adjacency graph, taking the atom adjacency graph and randomly initialized atom characterization vectors as the input of a graph attention network model, and updating and learning to obtain the characterization vectors of all atoms in the compound;

acquiring an amino acid sequence of a protein, extracting residues with fixed length from the amino acid sequence by adopting a sliding window method, and updating and learning a randomly initialized residue characterization vector by using a convolutional neural network model;

calculating attention coefficients of each atom for the residue and each residue for the atom in two directions by the constructed bidirectional attention network model according to the characterization vectors of all atoms in the compound and the characterization vectors of all residues in the protein; then, performing weighted fusion on all the atom characterization vectors and all the residue characterization vectors by using the obtained attention coefficients to obtain fused compound feature vectors and protein feature vectors;

performing outer product operation on the compound characteristic vector and the protein characteristic vector, and expanding an operation result into a one-dimensional column vector which is used as the input of a first neural network model and is used for predicting whether interaction exists between the compound and the protein; for the sample with interaction, a one-dimensional column vector expanded by the operation result of the outer product is used as the input of a second neural network model for predicting the affinity magnitude between the compound and the protein.

2. The method of claim 1, wherein the RDkit tool is used to convert the compound formula toAtom adjacency graph G ═ { V, E }; where V is the set of nodes of the atomic adjacency graph, all nodes correspond one-to-one to all atoms of the compound, ViE.V represents the ith atom of the combination; e is the set of edges of the atomic adjacency graph, EijE chemical bond between the ith atom and the jth atom.

3. The method according to claim 1, wherein the atom adjacency graph and the randomly initialized atom feature vectors are used as input of a graph attention network model, and feature vectors of all atoms in the compound are obtained through updating and learning, specifically:

a1, according to the formulaCalculate every two atoms vi,vjAttention coefficient α therebetweenijWhereinAre each an atom vi,vjThe characterization vectors that are initialized at random are,attention parameters for the graph attention network model;

a2, for each atom v of the compoundiAccording to all its neighbour nodes vjIs characterized by a token vectorAnd its and all neighbor nodes vjAttention coefficient α therebetweenijUpdating the atom v by means of weighted summationiIs characterized by a token vector Wherein N isiIs an atom viAll the neighboring nodes of (1), the compound and the atom viAll atoms chemically bound being atoms viOf the neighboring node.

4. The method of claim 3, wherein steps A1-A2 are repeated K times, and each atom is fused with the resulting token vector K times to obtain a final token vector for each atom of the compound.

5. The method of claim 1, wherein each extracted residue comprises 3 contiguous amino acids in the amino acid sequence, such that the amino acid sequence of the protein, S ═ S1,s2,...smExtracting to obtain residue sequence R ═ R }1,r2,...rl}; wherein s isiI-1, 2, m represents the i-th amino acid of the protein, riI is 1,2, l represents the ith residue in the residue sequence R, and l is m-2.

6. The method of claim 1, wherein the attention coefficients of each atom for the residue and each residue for both directions of the atom are calculated by the constructed bidirectional attention network model according to the characterization vectors of all atoms in the compound and the characterization vectors of all residues in the protein; and then, respectively carrying out weighted fusion on all the atom characterization vectors and all the residue characterization vectors by using the obtained attention coefficients to obtain fused compound feature vectors and fused protein feature vectors, which specifically comprise the following steps:

b1, converting the atom characteristic vector of the compound and the residue characteristic vector of the protein into a uniform vector dimension d, and respectively representing the uniform vector dimension d as a compound characteristic matrixAnd protein feature matrix

B2, fusing the compound characteristic matrix C and the protein characteristic matrix P to obtain an interaction matrix A, wherein the calculation formula is as follows:

A=CUPT

wherein U is a parameter matrix for fusing the characteristics of the compound and the protein, and U is the same as Rd×d

B3, calculating Compound information I of residue transfer to atomcAnd protein information I with atom transfer to residuepThe calculation formula is as follows:

Ic=APWr2a

Ip=APWa2r

in the formula, Wr2aAnd Wa2rRespectively, for calculating two different directions of propagation, Wr2a∈Rd×d,Wa2r∈Rd×d

B4, calculating the attention coefficient alpha of the atom to the residuea2rAnd the attention coefficient of the residue to the atom αr2aThe calculation formula is as follows:

αa2r=[CWc||Ic]aa2r

αr2a=[PWp||Ip]ar2a

in the formula, WcAnd WpAre parameters of the spatial transformation of the compound and protein vectors, W, respectivelyc∈Rd×d,Wp∈Rd×d(ii) a | | represents the operation of vector splicing; a isa2rAnd ar2aParameters for calculating the attention coefficients of two different directions, aa2r∈Rd×d,ar2a∈Rd×d

B5, fusing the characterization vectors of compound atoms and protein residues according to the corresponding attention coefficients to obtain compound feature vectorsAnd protein feature vectorsThe calculation formula is as follows:

7. the method of claim 5, wherein steps B1-B5 are repeated L times, each timeAndall the results are the results of 1 independent bidirectional attention network model, and the results of L independent bidirectional attention network models are fused to obtain the final compound feature vectorAnd protein feature vectors

8. The method of claim 5, wherein the attention coefficient a calculated in step B4 is used before the compound feature vector and the protein feature vector are obtained by weighted fusion of the attention coefficients in step B5a2rAnd ar2aAnd respectively carrying out normalization processing, and then calculating the compound characteristic vector and the protein characteristic vector by weighted fusion in the step B5.

9. The method of claim 1, wherein the first neural network adopts a two-class neural network structure, and the training sample label has only two values of 1 and 0, which respectively represent the presence and absence of interaction; the second neural network adopts a regression analysis type neural network structure, and values of all training sample labels cover the whole affinity value range.

Technical Field

The invention belongs to the field of medicine prediction and analysis, and particularly relates to a method for predicting interaction and affinity of a compound and a protein based on end-to-end learning.

Background

In the process of drug development, the prior determination of target proteins targeting specific diseases is the basis of drug development, and the search for compound molecules capable of interacting with specific target proteins is the key of drug development. The target is a biological macromolecule which is closely related to the occurrence of certain diseases in vivo and can be specifically combined with a medicament to generate a treatment effect, and the biological macromolecule mainly comprises a receptor, nucleic acid, a gene and the like. The compound molecules in the medicine can achieve the effect of curing or relieving corresponding diseases by regulating and controlling the biological activity of the target. The interaction between a pharmaceutical compound and a target protein is actually a specific binding relationship, and the strength of the binding relationship is also referred to as binding affinity. The identification of the interaction between a compound and a protein and the determination of the binding affinity between the two are key steps in the process of drug development and are of great significance to drug development. The identification of the interaction between a compound and a protein and the determination of binding affinity by using a conventional experimental method have the problems of long experimental period, high cost and the like, and cannot be applied on a large scale. Developing effective computational methods to predict interactions and binding affinities between compounds and proteins can accelerate expensive and time-consuming experimental work, reduce blind biochemical experiments, focus on fewer and more likely compound molecules and target proteins, thereby greatly shortening the period of drug development, reducing development costs, and reducing risks associated with development failures. With the continuous improvement of technologies such as genomics, proteomics, system biology and the like, data related to compounds and proteins are increased in a blowout manner, and massive data resources are provided for a data-driven computing method.

Traditional calculation methods can be used for analyzing the binding mode of the interaction between the compound and the protein and calculating the binding affinity between the compound and the protein, and mainly comprise methods based on ligands, structures, molecular dynamics models and the like. However, these methods have limitations, and the ligand-based methods are limited by the number of ligands known to the target, the structure-based methods rely heavily on the three-dimensional structural data of the target protein, and the molecular dynamics models are limited by high computational costs. Most of the currently mainstream calculation methods focus on the binary prediction of the interaction between a compound and a protein, namely, the prediction of whether the given compound and the protein have the interaction or not, and neglects important interaction strength information, namely the size of binding affinity. Although some methods for predicting binding affinity of compounds and proteins exist, the biological interpretability and prediction accuracy of these methods are still to be further improved.

Disclosure of Invention

The invention provides a method for predicting the interaction and affinity of a compound and a protein based on end-to-end learning, which can be used for predicting the interaction of the compound and the protein and predicting the binding affinity between the compound and the protein, and has better biological interpretability and prediction accuracy.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a method for predicting compound and protein interactions and affinities based on end-to-end learning, comprising:

obtaining a molecular formula of a compound, converting the molecular formula into an atom adjacency graph, taking the atom adjacency graph and randomly initialized atom characterization vectors as the input of a graph attention network model, and updating and learning to obtain the characterization vectors of all atoms in the compound;

acquiring an amino acid sequence of a protein, extracting residues with fixed length from the amino acid sequence by adopting a sliding window method, and updating and learning a randomly initialized residue characterization vector by using a convolutional neural network model;

calculating attention coefficients of each atom for the residue and each residue for the atom in two directions by the constructed bidirectional attention network model according to the characterization vectors of all atoms in the compound and the characterization vectors of all residues in the protein; then, performing weighted fusion on all the atom characterization vectors and all the residue characterization vectors by using the obtained attention coefficients to obtain fused compound feature vectors and protein feature vectors;

performing outer product operation on the compound characteristic vector and the protein characteristic vector, and expanding an operation result into a one-dimensional column vector which is used as the input of a first neural network model and is used for predicting whether interaction exists between the compound and the protein; for the sample with interaction, a one-dimensional column vector expanded by the operation result of the outer product is used as the input of a second neural network model for predicting the affinity magnitude between the compound and the protein.

In a more preferred embodiment, the RDKit tool is used to convert the compound molecular formula into an atomic adjacency graph G ═ { V, E }; where V is the set of nodes of the atomic adjacency graph, all nodes correspond one-to-one to all atoms of the compound, ViE.V represents the ith atom of the combination; e is the set of edges of the atomic adjacency graph, EijE chemical bond between the ith atom and the jth atom.

In a more preferred technical solution, the atom adjacency graph and the randomly initialized atom characterization vectors are used as inputs of a graph attention network model, and the characterization vectors of all atoms in the compound are obtained through updating and learning, specifically:

a1, according to the formulaCalculate every two atoms vi,vjAttention coefficient α therebetweenijWhereinAre each an atom vi,vjThe characterization vectors that are initialized at random are,attention parameters for the graph attention network model;

a2, for each atom v of the compoundiAccording to all its neighbour nodes vjIs characterized by a token vectorAnd its and all neighbor nodes vjAttention coefficient α therebetweenijUpdating the atom v by means of weighted summationiIs characterized by a token vector Wherein N isiIs an atom viAll the neighboring nodes of (1), the compound and the atom viAll atoms chemically bound being atoms viOf the neighboring node.

In a more preferable technical scheme, the steps A1-A2 are repeated for K times, and each atom is fused with the characterization vector obtained by K times to obtain the final characterization vector of each atom of the compound.

In a more preferred embodiment, each extracted residue comprises 3 contiguous amino acids in the amino acid sequence, such that the amino acid sequence S ═ S for the protein1,s2,…smExtracting to obtain residue sequence R ═ R }1,r2,…rl}; wherein s isiI-1, 2, m represents the i-th amino acid of the protein, riI is 1,2, l represents the ith residue in the residue sequence R, and l is m-2.

In a more preferred technical scheme, the attention coefficients of each atom for the residue and each residue for the atom in two directions are calculated according to the characterization vectors of all atoms in the compound and the characterization vectors of all residues in the protein through a constructed bidirectional attention network model; and then, respectively carrying out weighted fusion on all the atom characterization vectors and all the residue characterization vectors by using the obtained attention coefficients to obtain fused compound feature vectors and fused protein feature vectors, which specifically comprise the following steps:

b1, converting the atom characteristic vector of the compound and the residue characteristic vector of the protein into a uniform vector dimension d, and respectively representing the uniform vector dimension d as a compound characteristic matrixAnd protein feature matrix

B2, fusing the compound characteristic matrix C and the protein characteristic matrix P to obtain an interaction matrix A, wherein the calculation formula is as follows:

A=CUPT

wherein U is a parameter matrix for fusing the characteristics of the compound and the protein, and U is the same as Rd×d

B3, calculating Compound information I of residue transfer to atomcAnd protein information I with atom transfer to residuepThe calculation formula is as follows:

Ic=APWr2a

Ip=APWa2r

in the formula, Wr2aAnd Wa2rRespectively, for calculating two different directions of propagation, Wr2a∈Rd×d,Wa2r∈Rd×d

B4, calculating the attention coefficient alpha of the atom to the residuea2rAnd the attention coefficient of the residue to the atom αr2aThe calculation formula is as follows:

αa2r=[CWc||Ic]aa2r

αr2a=[PWp||Ip]ar2a

in the formula, WcAnd WpAre parameters of the spatial transformation of the compound and protein vectors, W, respectivelyc∈Rd×d,Wp∈Rd×d(ii) a | | represents the operation of vector splicing; a isa2rAnd ar2aParameters for calculating the attention coefficients of two different directions, aa2r∈Rd×d,ar2a∈Rd×d

B5, fusing the characterization vectors of compound atoms and protein residues according to the corresponding attention coefficients to obtain compound feature vectorsAnd protein feature vectorsThe calculation formula is as follows:

in a more preferred embodiment, the steps B1-B5 are repeated L times, each timeAndall the results are the results of 1 independent bidirectional attention network model, and the results of L independent bidirectional attention network models are fused to obtain the final compound feature vectorAnd protein feature vectors

In a more preferred embodiment, before the attention coefficient is used for weighted fusion in step B5 to obtain the compound feature vector and the protein feature vector, the attention coefficient a obtained in step B4 is calculateda2rAnd ar2aAnd respectively carrying out normalization processing, and then calculating the compound characteristic vector and the protein characteristic vector by weighted fusion in the step B5.

In a more preferred technical scheme, the first neural network adopts a neural network structure of two classes, and training sample labels only have two values of 1 and 0, which respectively represent the existence of interaction and the absence of interaction; the second neural network adopts a regression analysis type neural network structure, and values of all training sample labels cover the whole affinity value range.

Advantageous effects

The invention provides a method for predicting the interaction and affinity of a compound and a protein based on end-to-end learning, which has the following beneficial effects compared with the prior art: the method can be used for predicting the interaction between the compound and the protein and predicting the binding affinity between the compound and the protein; the use of a two-way attention network model to fuse the characterization vectors of all atoms in a compound with the characterization vectors of all residues in a protein can increase the bioanalysis of the prediction method; a large number of experiments show that the method can obtain better prediction accuracy in the interaction prediction and the binding affinity prediction; the method can be used for assisting virtual drug screening and drug relocation, reducing blind experimental work, saving time and cost of drug research and development, and relieving pressure of drug research and development.

Drawings

FIG. 1 is a flow chart of a prediction method of the present invention;

FIG. 2 is a graph comparing AUC and AUPR values on a human data set for the present invention and a comparative method;

fig. 3 is a graph comparing AUC and aucr values on c.elegans datasets for the invention and the comparative method;

FIG. 4 is a graph of RMSE and PCC values versus binding affinity data sets for the inventive and comparative methods.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as broadly as the present invention is capable of modification in various respects, all without departing from the spirit and scope of the present invention.

As shown in fig. 1, the embodiment of the present invention specifically discloses a method for predicting the interaction and affinity of a compound and a protein based on end-to-end learning, which comprises the following steps:

step 1, obtaining a molecular formula of a compound, converting the molecular formula into an atom adjacency graph, taking the atom adjacency graph and randomly initialized atom characterization vectors as input of a graph attention network model, and updating and learning to obtain the characterization vectors of all atoms in the compound.

Specifically, the RDKit tool can be used to convert the compound molecular formula into an atomic adjacency graph G ═ { V, E }; where V is the set of nodes of the atomic adjacency graph, all nodes correspond one-to-one to all atoms of the compound, ViE.V represents the ith atom of the combination; e is the set of edges of the atomic adjacency graph, EijE chemical bond between the ith atom and the jth atom.

The graph attention network model can assign different weights to neighbor nodes of each node in the atom adjacency graph, and can extract characteristic information from the neighbor nodes of each atom in the compound to update and learn the characterization vector of each atom in the compound. The method specifically comprises the following steps:

step A1, according to the formulaCalculate every two atoms vi,vjAttention coefficient α therebetweenijWhereinAre each an atom vi,vjThe characterization vectors that are initialized at random are,i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to n for attention parameters of the attention network model; all attention coefficients are expressed as:

step A2, v for each atom of the compoundiAccording to all its neighbour nodes vjIs characterized by a token vectorAnd its and all neighbor nodes vjAttention coefficient α therebetweenijUpdating the atom by means of weighted summationviIs characterized by a token vector Wherein N isiIs an atom viAll the neighboring nodes of (1), the compound and the atom viAll atoms chemically bound being atoms viOf the neighboring node.

In a more preferred embodiment, the graph attention network model adopts a multi-head attention mechanism, and a plurality of independent computation results of the graph attention network model are fused, so that a more accurate atom characterization vector can be obtained. The concrete embodiment is as follows: and repeating the steps A1-A2 for K times, and fusing the atoms for K times to obtain a characterization vector to obtain the final characterization vector of each atom of the compound.

And 2, obtaining an amino acid sequence of the protein, extracting residues with fixed length from the amino acid sequence by adopting a sliding window method, and updating and learning the randomly initialized residue characterization vector by using a convolutional neural network model.

Using a sliding window of fixed length 3 and step size 1, from the amino acid sequence of the protein, S ═ S1,s2,…smSequentially extracting residues, each extracted residue comprises 3 adjacent amino acids in the amino acid sequence, and all residues are expressed as a residue sequence R ═ { R }1,r2,…rl}; wherein s isi(i-1, 2, m) represents the i-th amino acid of the protein, and r representsi(i ═ 1,2,. l) denotes the i-th residue in the residue sequence R, and l ═ m-2. For example, a protein with the amino acid sequence mrpsg. The "MRP", "RPS", "PSG", "FIG", "IGA", wherein each subsequence is a residue. Randomly initializing each residue and expressing the residue as a vector, namely, representing the residue characterization vector as a sum of all residue characterization vectors of the proteinTogether, a two-dimensional matrix can be formed, and then the two-dimensional matrix is input into a convolutional neural network to carry out convolution and pooling operations, which is equivalent to calculating and converting input residue characterization vectors and deeply learning the internal terms.

The hyper-parameters of the convolutional neural network mainly comprise the number of convolutional layers, the size and the number of filters, and the learned residue characterization vectors are input into the bidirectional attention network model in the step 3 for prediction.

Step 3, calculating attention coefficients of each atom to the residue and each residue to the atom in two directions through a constructed bidirectional attention network model according to the characterization vectors of all atoms in the compound and the characterization vectors of all residues in the protein; and performing weighted fusion on all the atom characterization vectors and all the residue characterization vectors by using the obtained attention coefficients to obtain fused compound feature vectors and protein feature vectors. The method specifically comprises the following steps:

b1, characterizing the atoms of the compound and the residues of the protein

Are converted into a uniform vector dimension d and are respectively expressed as a compound feature matrixAnd protein feature matrix

B2, fusing the compound characteristic matrix C and the protein characteristic matrix P to obtain an interaction matrix A, wherein the calculation formula is as follows:

A=CUPT

wherein U is a parameter matrix for fusing the characteristics of the compound and the protein, and U is the same as Rd×d

B3, calculating Compound information I of residue transfer to atomcAnd protein information I with atom transfer to residuepThe calculation formula is as follows:

Ic=APWr2a

Ip=APWa2r

in the formula, Wr2aAnd Wa2rRespectively, for calculating two different directions of propagation, Wr2a∈Rd×d,Wa2r∈Rd×d

B4, calculating the attention coefficient alpha of the atom to the residuea2rAnd the attention coefficient of the residue to the atom αr2aThe calculation formula is as follows:

αa2r=[CWc||Ic]aa2r

αr2a=[PWp||Ip]ar2a

in the formula, WcAnd WpAre parameters of the spatial transformation of the compound and protein vectors, W, respectivelyc∈Rd×d,Wp∈Rd×d(ii) a | | represents the operation of vector splicing; a isa2rAnd ar2aParameters for calculating the attention coefficients of two different directions, aa2r∈Rd×d,ar2a∈Rd×d

B5, fusing the characterization vectors of compound atoms and protein residues according to the corresponding attention coefficients to obtain compound feature vectorsAnd protein feature vectorsThe calculation formula is as follows:

the above parameters U, Wr2a,Wa2r,Wc,Wp,aa2r,ar2aEtc., these parameters areThe bidirectional attention network model is obtained through initialization and can be updated and learned.

In a more preferred embodiment, the bidirectional attention network model adopts a multi-head attention mechanism, and more accurate characteristic vectors of the compound and the protein can be obtained by fusing calculation results of a plurality of independent bidirectional attention network models. The method is characterized in that: repeating the steps B2-B5L times, each time repeating the obtainedAndall the results are the results of 1 independent bidirectional attention network model, and the final compound feature vector is obtained by fusing the results of L independent bidirectional attention network modelsAnd protein feature vectors

In a more preferred embodiment, the attention factor a obtained in the execution of step B4a2rAnd ar2aAnd respectively carrying out normalization processing by a softmax function, and then carrying out weighted fusion calculation on the compound characteristic vector and the protein characteristic vector in the step B5.

Step 4, performing outer product operation on the compound characteristic vector and the protein characteristic vector, and expanding an operation result into a one-dimensional column vector which is used as the input of a first neural network model and is used for predicting whether interaction exists between the compound and the protein; for the sample with interaction, a one-dimensional column vector expanded by the operation result of the outer product is used as the input of a second neural network model for predicting the affinity magnitude between the compound and the protein.

The first neural network adopts a two-classification neural network structure, and the training sample label has only two values of 1 and 0, which respectively represent the existence of interaction and the absence of interaction.

The second neural network adopts a regression analysis type neural network structure, and the values of all the training sample labels cover the whole affinity value range, so that the affinity obtained by actual prediction of a given compound and protein can be any value in the affinity value range.

Experimental verification

To verify the effectiveness of the prediction method of the present invention, compound and protein interaction prediction and compound and protein binding affinity prediction were performed on two different types of data sets, respectively, and comparative analysis was performed with the different prediction methods.

Two interaction datasets, human and c.elegans, were used and compared to 5 other methods (BLM-NII, NetLabRLS, CMF, NRLMF and Tsubaki et al) for compound and protein interaction prediction, which were collected and collated from the drug bank, Matador and STITCH databases. In order to evaluate the accuracy of the invention in predicting the interaction between the compound and the protein, two indexes of AUC and AUPR are used for comparison, and three positive and negative sample ratios (1: 1, 1: 3 and 1: 5) are set to evaluate the robustness of the method. The AUC value is the area under the ROC curve, the AUPR value is the area under the PR curve, and the higher the AUC value and the AUPR value are, the better the prediction accuracy is. The AUC value is insensitive to the proportion of positive and negative samples, and the AUPR value is used as an evaluation index to give a more real comparison result on an unbalanced data set. The experimental results of AUC and aucr values on the human data set are shown in particular in fig. 2, and the experimental results of AUC and aucr values on the c. It can be seen that the prediction method of the present invention achieved the highest AUC and aucr values on both human and c. In addition, with the increase of the proportion of the negative samples, the AUC value of most prediction methods is almost kept unchanged or slightly increased, while the AUPR value is basically reduced, but the AUPR value of the prediction method is still significantly higher than that of other prediction methods. Therefore, the prediction method provided by the invention has a good prediction effect on the prediction of the interaction between the compound and the protein.

Four binding affinity datasets, IC50, Ki, Kd and EC50, were used from the BindingDB database compilation for compound and protein binding affinity predictions and compared to 5 other methods (Ridge Regression, Lasso Regression, Random Forest, deep affinity and MONN). To evaluate the accuracy of the present invention in predicting binding affinity of compounds and proteins, a comparison was made using two indicators, Root Mean Square Error (RMSE) and Pearson Correlation Coefficient (PCC). The root mean square error is an index for measuring the error between the predicted value and the true value, and the smaller the value is, the smaller the prediction error is, the better the performance of the prediction model is. The Pearson correlation coefficient is a linear correlation coefficient and is used for reflecting the linear correlation degree between the predicted value and the true value, the value is between-1 and 1, the value is larger than 0 to represent positive correlation, the value is smaller than 0 to represent negative correlation, the value is closer to 1, the more positive correlation between the predicted value and the true value is represented, and the better the performance of the prediction model is. The results of the experiments with RMSE and PCC values are shown in detail in figure 4. It can be seen that the prediction method of the present invention achieved the lowest RMSE and highest PCC values on the two larger data sets of IC50 and Ki, the lowest RMSE and the same PCC values as the mon method on the EC50 data set, and the second best RMSE and PCC values (slightly worse than the mon method) on the lowest Kd data set due to the lower sample size and less accurate characterization vectors of the compounds and proteins learned. Therefore, the prediction method provided by the invention has a good prediction effect on the prediction of the binding affinity of the compound and the protein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一套具有包容性且精准鉴别并挖掘稻瘟病Pik抗病等位基因家族的技术体系

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!