Protein-ligand binding site prediction algorithm based on deep learning

文档序号：1629540 发布日期：2020-01-14 浏览：25次中文

阅读说明：本技术 一种基于深度学习的蛋白质-配体结合位点预测算法 (Protein-ligand binding site prediction algorithm based on deep learning ) 是由夏春秋杨旸沈红斌于 2019-09-18 设计创作，主要内容包括：本发明公开了一种基于深度学习的蛋白质-配体结合位点预测算法,对于待预测的蛋白质,首先提取其序列特征和距离矩阵,然后将序列特征通过滑动窗口方法分配到每个残基上,然后将残基所对应的特征逐个输入到残差神经网络和混合神经网络中,并将残差神经网络和混合神经网络的输出结果输入到Logistic回归分类器中,最终结果即为蛋白质中每个残基对应的结合概率。本发明将经典的双向长短时记忆网络和残差神经网络进行了融合,融合后的网络可以同时处理异构的蛋白质序列和结构数据,并挖掘出了序列特征和结构特征的互补性。与现有方法相比,有着更高的预测精度,且针对不同配体的数据集都有着不错的泛化性能。(The invention discloses a protein-ligand binding site prediction algorithm based on deep learning, for a protein to be predicted, sequence characteristics and a distance matrix of the protein are firstly extracted, then the sequence characteristics are distributed to each residue through a sliding window method, the characteristics corresponding to the residues are input into a residual neural network and a mixed neural network one by one, the output results of the residual neural network and the mixed neural network are input into a Logistic regression classifier, and the final result is the binding probability corresponding to each residue in the protein. According to the invention, a classical bidirectional long-time and short-time memory network and a residual neural network are fused, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated. Compared with the existing method, the method has higher prediction accuracy, and has good generalization performance aiming at data sets of different ligands.)

1. A deep learning based protein-ligand binding site prediction algorithm comprising the steps of:

step 1) firstly, extracting sequence characteristics of a protein structure data set, then calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; finally, intercepting a feature tensor of each residue by using a sliding window method;

step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; randomly up-sampling a positive sample when constructing the mini-batch;

step 3), constructing a residual error neural network by using a residual error module, and training on the distance matrix;

step 4) integrating the residual error neural network and the bidirectional long-time memory network through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix;

step 5) training a Logistic regression classifier according to the output results of the residual error neural network and the mixed neural network;

and 6) for the protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the characteristics corresponding to the residues into a residual neural network and a mixed neural network one by one, and inputting the output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the combination probability corresponding to each residue in the protein.

2. The deep learning-based protein-ligand binding site prediction algorithm according to claim 1, wherein the extraction method of the sequence feature and distance matrix in step 1) is as follows:

step 1.1) for the protein with the length of L, obtaining a position specificity scoring matrix PSSM thereof through a PSI-BLAST algorithm; the PSSM has a size of L × 20, wherein the ith row and the jth column element p_ijIndicates the possibility of mutating the ith residue into j amino acids, and the total number of the amino acids is 20;

then for each p_ijNormalization was performed as follows:

step 1.2) for the protein with the length of L, obtaining a scoring matrix HHM through an HHblits algorithm, wherein the HHM identifies the evolution information of the protein sequence; HHM size is L x 20, wherein the first 20 columns are emission probability of 20 amino acids, 21-27 columns are transition probability, 28-30 columns are local diversity;

for element h in HHM_ijNormalization was performed as follows:

step 1.3) predicting the secondary structure information and relative solvent accessibility of the protein with the length L by using an SCRATCH algorithm; the secondary structure information is represented as an L x 3 matrix, where each row s_iRepresenting the secondary structure of the ith residue as a helix, strand or otherwise in the form of a one-hot vector; solvent accessibility is represented as an L2 matrix, where each row r_iRepresenting the status of the ith residue as exposed or buried in the form of a one-hot vector;

step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; each of whichAn element q_i0And q is_i1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residue_i0And q is_i1The sum of (1);

step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residues_αThe Euclidean distance between them, denoted as d_ij；

Constructing a distance matrix D ═ D according to the sequence order_ij}^L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;

3. The deep learning based protein-ligand binding site prediction algorithm according to claim 1 or 2, wherein the random down-sampling in step 2) and the up-sampling in the mini-batch satisfy the following condition:

1) in random down-sampling, each negative sample is randomly selected from the original data set with a probability of 20%, and the selected negative sample and all positive samples are combined into a training subset; obtaining N in the same manner_setA training subset;

2) in upsampling in the mini-batch, N is cyclically selected from the set of all positive samples and the set of all negative samples_pA positive sample and N_nA negative sample according to N_pThe following formula gives:

N_p＝[0.3×N_b]

wherein N is_bIs the size of the mini-batch [. degree]Is a rounded symbol, and N_n＝N_b-N_p。

4. The deep learning based protein-ligand binding site prediction algorithm of claim 3, wherein the definition of the residual block and the construction of the residual neural network are as follows:

in a neural network, the convolutional layer can be represented as Conv (X, W, H, D), where X is the input variable, W and H are the width and height of the convolutional kernels, respectively, and D is the number of convolutional kernels; the residual block is formed by stacking three convolution layers as shown in the following formula:

Res(X)＝σ(Conv(σ(Conv(σ(Conv(X，1，1，D))，3，3，D))，1，1，4×D)+X)

wherein σ is an activation function; the residual error neural network is formed by stacking a plurality of residual error blocks and optimized by an Adam algorithm, and the input of the residual error neural network is a distance matrix of each residue;

in said N_setOn each subset, N can be trained for each residue in the protein_resA separate residual neural network, wherein N_res≤N_set。

5. The deep learning based protein-ligand binding site prediction algorithm of claim 4, wherein the hybrid neural network in step 4) integrates a residual neural network and BilSTM and is optimized by Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;

in said N_resOn subsets, N can be trained for each residue in the protein_hybridA separate hybrid network, wherein N_hybrid＝N_set-N_res。

6. The deep learning-based protein-ligand binding site prediction algorithm of claim 5, wherein the N for each residue in step 5) is_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding to the loss function of the Logistic classifierL to₁The regularization term prevents overfitting.

7. The deep learning-based protein-ligand binding site prediction algorithm of claim 6, wherein in the step 6), for a length L and C_aFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the combination probability corresponding to each residue in the protein.

Technical Field

The invention relates to the field of protein biology and pattern recognition, in particular to a protein-ligand binding site prediction algorithm based on deep learning.

Background

The interaction of proteins with ligands plays important roles in biological processes, such as signal transduction, post-translational modification, and antigen-antibody interaction. In addition, drug discovery and design also relies heavily on the analysis of the mechanism of protein-ligand interaction. For further exploration of the mechanism behind protein-ligand interactions, recognition of the binding site is a very critical step. As protein design techniques have emerged, and more new proteins have emerged, their properties and functions have not been explored, and the need for rapid, accurate binding site recognition tools has become more urgent. The current method for identifying the binding site of the protein by a wet experiment has the defects that: time consuming and costly.

Protein-ligand interactions can be classified into protein-protein interactions, protein-DNA/RNA interactions, and protein-small molecule interactions, depending on the type of ligand. At this stage, there are many computational methods based on sequence information (protein primary structure) or structural information (protein tertiary structure) that can predict protein-ligand binding sites.

Sequence-based methods can make site predictions for proteins with unknown three-dimensional structures using some purely sequence-based features such as evolutionary information and predicted secondary structures. However, since the position of the binding site is mainly determined by the tertiary structure of the protein, the prediction accuracy of the sequence-based method is relatively low.

The structure-based methods all require three-dimensional coordinates of every atom in the protein as input, but they follow different evaluation criteria, such as POCKETs assume that the binding SITE is more likely to be located in a depressed region of the protein surface, SITEHOUND uses an energy function to calculate the force field between the protein and the ligand, and TM-SITE is a template-based matching method.

Disclosure of Invention

The invention aims to provide a protein-ligand binding site prediction algorithm based on deep learning aiming at the current situation that the prediction algorithm in the prior art is low in precision so as to solve the problems in the prior art.

The invention provides a prediction method with higher precision by fusing a deep learning technology and the field knowledge of a protein structure aiming at the application scene of protein-ligand binding site recognition, and also provides an effective solution for partial problems, such as data imbalance problem, difficulty in registration between three-dimensional structures and the like.

The technical problem solved by the invention can be realized by adopting the following technical scheme:

a deep learning based protein-ligand binding site prediction algorithm comprising the steps of:

step 3), constructing a residual error neural network by using a residual error module, and training on the distance matrix;

step 4), integrating the built residual error neural network and the bidirectional long-time memory network through a full connection layer, building a hybrid neural network, and training on the sequence characteristics and the distance matrix;

step 5) training a Logistic regression classifier according to the output results of the residual error neural network and the mixed neural network;

and 6) for the protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the residues into a residual neural network and a mixed neural network one by one, and inputting output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the corresponding combination probability of each residue in the protein.

Further, the method for extracting the sequence feature and the distance matrix in the step 1) is as follows:

then for each p_ijNormalization was performed as follows:

for element h in HHM_ijNormalization was performed as follows:

step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element q_i0And q is_i1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residue_i0And q is_i1The sum of (1);

step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residues_αThe Euclidean distance between them, denoted as d_ij；

Constructing a distance matrix D ═ D according to the sequence order_ij}^L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;

Further, the random down-sampling in the step 2) and the up-sampling in the mini-batch need to satisfy the following conditions:

N_p＝[0.3×N_b]

wherein N is_bIs the size of the mini-batch [. degree]Is a rounded symbol, and N_n＝N_b-N_p。

Further, the definition of the residual block and the construction process of the residual neural network are as follows:

Res(X)＝σ(Conv(σ(Conv(σ(Conv(X，1，1，D))，3，3，D))，1，1，4×D)+X)

in said N_setOn each subset, N can be trained for each residue in the protein_resA separate residual neural network, wherein N_res≤N_set。

Further, the hybrid neural network in the step 4) integrates a residual neural network and the BilSTM, and is optimized by an Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;

in said N_setOn subsets, N can be trained for each residue in the protein_hybridA separate hybrid network, wherein N_hybrid＝N_set-N_res。

Further, N corresponds to each residue in the step 5)_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding l to the loss function of the Logistic classifier₁The regularization term prevents overfitting.

Further, in the step 6), for a length L and C_αFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the combination probability corresponding to each residue in the protein.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a novel hybrid neural network, which fuses a classical bidirectional long-term memory network and a residual neural network, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated.

2. The invention adopts a random down-sampling and integration method to solve the problem of unbalance of positive and negative samples, and adopts batch-by-batch up-sampling of positive samples to further reduce the influence of the data input in the form of mini-batch in a neural network.

3. Compared with the existing method, the method has higher prediction precision, and has good generalization performance aiming at data sets of different ligands.

Drawings

FIG. 1 is a flow chart of the deep learning-based protein-ligand binding site prediction algorithm of the present invention.

FIG. 2 is a schematic diagram of a residual error network module according to the present invention.

The device comprises a hybrid neural network architecture diagram (a), a sequence feature and distance matrix extraction module (b) and a bidirectional long-time and short-time memory network module (c).

FIG. 3 is a schematic diagram of a random sampling and integration method according to the present invention.

FIG. 4 is a schematic diagram of an implementation of a residual block in the residual neural network of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Referring to fig. 1, the present invention provides a deep learning-based protein-ligand binding site prediction algorithm, which comprises the following steps:

step 1) for a given protein structure data set, firstly, respectively extracting evolution information, secondary structure information, relative solvent accessibility and combination probability of the given protein structure data set by utilizing a PSI-BLAST algorithm, a HHblits algorithm, a SCRATH algorithm and an S-SITE algorithm, and carrying out normalization processing on the evolution information; secondly, calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; truncating the feature tensor for each residue using a sliding window strategy;

step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; then randomly up-sampling a positive sample when constructing the mini-batch;

step 3), constructing a residual error neural network (ResNet) by using a residual error module, and training on the distance matrix obtained in the step 1);

step 4) integrating the residual error network in the step 3) with a bidirectional long-time and short-time memory network (BiLISTM) through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix obtained in the step 1);

step 5) training a Logistic regression classifier by using the residual error neural network in the step 3) and the output result of the mixed network in the step 4);

and 6) for a protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the characteristics to each residue through a sliding window method, then inputting the characteristics into a residual error network and a mixed neural network one by one, and then inputting an output result into a Logistic regression classifier, wherein a final result is the corresponding combination probability of each residue in the protein.

Wherein the specific process of the step 1) is as follows:

then for each p_ijNormalization was performed as follows:

for element h in HHM_ijNormalization was performed as follows:

step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element q_i0And q is_i1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residue_i0And q is_i1The sum of (1);

step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residues_αThe Euclidean distance between them, denoted as d_ij；

Constructing a distance matrix D ═ D according to the sequence order_ij}^L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;

step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; as shown in part a of fig. 2, the 4 features will be divided into two groups that are input to two bilstms, one of which contains only PSSM, SS (secondary structure information predicted by SCRATCH), RSA (relative solvent accessibility predicted by SCRATCH) and SST (binding tendency predicted by S-SITE), and the other contains only HHM, SS, RSA and SST. And intercepting the distance matrix by using a sliding window with the same size W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.

The random down-sampling and the up-sampling in the mini-batch in the step 2) are shown in fig. 3, and the following conditions need to be satisfied:

N_p＝[0.3×N_b]

wherein N is_bIs the size of the mini-batch [. degree]Is a rounded symbol, and N_n＝N_b-N_p。

Further, in step 3, the definition of the residual block and the construction of the residual network are as follows:

as shown in fig. 4, the residual block is generally composed of a plurality of convolutional layers and an identity map, and nonlinear mapping is implemented between convolutional layers by an activation function. Fig. 4 shows a general residual block on the left side and a residual block in the form of a bottleneck (bottleeck) on the right side, which is advantageous in that parameters can be reduced while ensuring performance. The present invention employs a bottleneck-form residual block, which is described as follows:

Res(X)＝σ(Conv(σ(Conv(σ(Conv(X，1，1，D))，3，3，D))，1，1，4×D)+X)

wherein σ is an activation function, Conv (X, W, H, D) is a convolution function, X is an input variable, W and H are the width and height of a convolution kernel respectively, and k is the number of the convolution kernels;

in the invention, a residual error network is formed by stacking a plurality of residual error blocks, as shown in fig. 2(b), and is optimized by an Adam algorithm, wherein the input of the network is a distance matrix of each residue. The specific network architecture is summarized in table 1.

^aThe setting of the convolution layer respectively represents the size of convolution kernels, the number of the convolution kernels and the step length;

^bthe step size of the residual block in the form of a bottleneck is 1.

TABLE 1 residual neural network module architecture

In said N_setOn each subset, N can be trained for each residue in the protein_resA separate residual neural network, wherein N_res≤N_set。

Further, in the step 4), the hybrid neural network integrates the residual error network and the BiLSTM in the step 3) through a full connection layer, and is optimized through an Adam algorithm, and the overall architecture of the hybrid neural network is shown in fig. 2. As described in step 2), the inputs of two bilstms are two sets of sequence features, respectively.

In said N_setOn subsets, N can be trained for each residue in the protein_hybridA separate hybrid network, wherein N_hybrid＝N_set-N_res。

N corresponding to each residue in the step 5)_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); training a Logistic regression classifier by taking the vector as input in a cross validation mode, wherein the specific form is shown in FIG. 3; adding l to the loss function of the Logistic classifier₁The regularization term prevents overfitting.

In the step 6), for a length L and C_αWaiting for prediction with known spatial coordinatesAnd (2) measuring the protein, firstly extracting the sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the residues into a plurality of residual neural networks and mixed neural networks one by one, then inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the result, namely the combination probability corresponding to each residue in the protein.

Then, dividing the binding probability by an optimal threshold T epsilon (0, 1) learned on a training set, and if the binding probability is greater than T, considering the residue as a binding site; conversely, this residue is considered a non-binding site.

14页详细技术资料下载

Protein-ligand binding site prediction algorithm based on deep learning

相关技术

网友询问留言