Method and system for predicting protein-polypeptide binding site

文档序号：193414 发布日期：2021-11-02 浏览：42次中文

阅读说明：本技术 一种预测蛋白质-多肽结合位点的方法及系统 (Method and system for predicting protein-polypeptide binding site ) 是由魏乐义王汝恒崔立真苏苒于 2021-08-09 设计创作，主要内容包括：本发明公开了一种预测蛋白质-多肽结合位点的方法及系统,包括：获取待预测的蛋白质-多肽序列数据,将所述数据输入到训练好的基于预训练模型BERT和对比学习的神经网络模型,输出位点级别的多肽结合概率,并确定输入序列中的各个位点是否结合；其中,所述基于预训练模型BERT和对比学习的神经网络模型首先将原始蛋白质-多肽序列中的每个氨基酸转换为嵌入矩阵,所述嵌入矩阵经过BERT编码和全连接神经网络层,得到每个氨基酸的低维表示矩阵；进行BERT编码时,通过构建对比损失进行约束,生成具有区分性的结合与非结合位点表示特征。本发明使用预训练模型BERT作为对原始蛋白质序列的编码器,可以自动提取特征,从而避免由预测工具所带来的问题。(The invention discloses a method and a system for predicting protein-polypeptide binding sites, which comprises the following steps: acquiring protein-polypeptide sequence data to be predicted, inputting the data into a trained neural network model based on a pre-training model BERT and contrast learning, outputting the polypeptide binding probability of site level, and determining whether each site in the input sequence is bound; the neural network model based on the pre-training model BERT and the contrast learning firstly converts each amino acid in an original protein-polypeptide sequence into an embedded matrix, and the embedded matrix is subjected to BERT coding and full-connection neural network layer to obtain a low-dimensional expression matrix of each amino acid; in BERT coding, differential binding and non-binding site representation features are generated by constructing contrast loss constraints. The present invention uses the pre-trained model BERT as an encoder for the original protein sequence, and can automatically extract features, thereby avoiding problems caused by prediction tools.)

1. A method of predicting a protein-polypeptide binding site, comprising:

acquiring protein-polypeptide sequence data to be predicted, inputting the data into a trained neural network model based on a pre-training model BERT and contrast learning, outputting the polypeptide binding probability of site level, and determining whether each site in the input sequence is bound;

the neural network model based on the pre-training model BERT and the contrast learning firstly converts each amino acid in an original protein-polypeptide sequence into an embedded matrix, and the embedded matrix is subjected to BERT coding and full-connection neural network layer to obtain a low-dimensional expression matrix of each amino acid; in BERT coding, differential binding and non-binding site representation features are generated by constructing contrast loss constraints.

2. The method of claim 1, wherein converting each amino acid in the original protein-polypeptide sequence into an insertion matrix comprises:

encoding the original protein sequence as a vector of numerical values; the encoded vectors are embedded by an embedding layer that is pre-trained over a large number of protein sequences to generate an initial embedding matrix.

3. The method of claim 2, wherein the original protein sequence is encoded as a vector of numerical values, in particular: each amino acid letter in the original protein sequence is first capitalized and translated into a numeric sequence according to a defined lexical dictionary, where each amino acid in the sequence is considered a word in a sentence and is mapped to a numeric value.

4. The method of claim 1, wherein the BERT code is encoded by:

5. The method of claim 1, wherein the BERT encoding is constrained by construction of contrast loss, comprising:

collecting a set number of expression matrixes to obtain enough site-level data for comparison learning;

contrast loss is constructed as a loss function for batch data, with samples of the same class having similar representations and samples of different classes having different representations.

6. The method of claim 1, wherein the site expression vector generated from the original protein sequence x is fed into a multi-level perceptron, and the feature vector is converted into a site-level class output; and the above process is trained using a cross entropy loss function.

7. The method of claim 1, wherein the recall rate, specificity, accuracy and mahius correlation coefficient are selected as the evaluation index of the neural network model based on the pre-trained model BERT and the comparative learning to evaluate the neural network model.

8. A system for predicting a protein-polypeptide binding site, comprising:

the data acquisition module is used for acquiring protein-polypeptide sequence data to be predicted;

the binding site prediction module is used for inputting the data into a trained neural network model based on a pre-training model BERT and contrast learning, outputting the polypeptide binding probability of a site level and determining whether each site in an input sequence is bound;

wherein the neural network model based on the pre-training model BERT and the contrast learning comprises:

a sequence embedding module for converting each amino acid in the original protein-polypeptide sequence into an embedding matrix;

a BERT-based encoder module for passing the embedded matrix through a BERT encoding and fully connected neural network layer to obtain a low-dimensional representation matrix of each amino acid;

the comparison learning module is used for carrying out constraint through constructing comparison loss when BERT coding is carried out;

and the output module is used for generating the distinguishing combined and non-combined site representation characteristics.

9. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method of predicting a protein-polypeptide binding site of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the method of predicting a protein-polypeptide binding site of any one of claims 1-7.

Technical Field

The invention relates to the technical field of biological information, in particular to a method and a system for predicting protein-polypeptide binding sites.

Background

Protein-polypeptide interactions are one of the important protein interactions and play a crucial role in many essential cellular processes, such as DNA repair, replication, gene expression and metabolism. Studies have also found that protein interactions involve abnormal cellular behaviors that can induce a variety of diseases, with approximately 40% of the interactions being mediated by relatively small polypeptides. Thus, the recognition of binding sites involved in protein-polypeptide interactions is essential for understanding both protein function and drug discovery.

Many experimental approaches have been developed to help find binding sites for protein-polypeptide interactions by determining the complex structure of the protein, and advances in structural biology have brought about a number of complex protein structural data. However, on the one hand such experiments are often expensive and time consuming to perform; on the other hand, the polypeptide has the characteristics of small size, weak affinity, strong flexibility and the like, so that the method for finding the protein-polypeptide binding site through a biological experiment still has the challenge. Therefore, there is a need for reliable computational methods to study protein-polypeptide binding problems.

Currently, computational methods for predicting protein-polypeptide binding sites can generally be divided into two categories, structure-based and sequence-based. Structure-based methods include PepSite, Peptimap, SPRINT-Str, and PepNN-Struct, among others. Sequence-based methods include SPRINT-Seq, PepBind, Visual, and PepNN-Seq, among others. Although many of the above-mentioned highly efficient computational methods have been proposed to solve the problem of predicting protein-polypeptide binding sites, the following aspects may not be fully considered in the actual prediction process:

first, in the absence of a related peptide binding protein structure, the binding site prediction method based on protein structure cannot predict. In fact, most proteins have accurate sequence information, but no defined structural data. Therefore, prediction methods that rely solely on protein sequence are more versatile and applicable to most proteins.

Secondly, the characteristics predicted by other tools such as a Position Specificity Scoring Matrix (PSSM) based on protein sequences have proved to be beneficial for the model to predict the binding sites, so most current methods rely on these manual characteristics to predict the binding sites. However, the use of these tools also poses many problems, such as incorrect installation of software kits, long processing times, and especially the inability to predict binding sites in bulk directly from raw sequence data.

Third, many current machine learning-based bioinformatics models achieve good results in classification tasks, but tend to perform poorly in the face of unbalanced data. However, protein-polypeptide datasets typically have more non-binding sites and fewer binding sites. Therefore, in order to avoid the influence caused by serious deviation of data distribution, an undersampled method is generally adopted to construct a balanced data set at present, or a few samples are simply given higher weight, so that the model focuses more on the data set. Undersampling the data set does not take full advantage of most samples; and since the weights may be closely related to the data set, randomly assigning a few classes of higher weights cannot be considered a general approach to dealing with such problems.

Disclosure of Invention

In view of this, the invention provides a method and a system for predicting protein-polypeptide binding sites, which are based on a pre-training model BERT and contrast learning, introduce self-designed contrast loss, can better mine the association among different types of data, solve the unbalanced problem of protein site prediction, and can effectively predict the protein-polypeptide binding sites.

In order to achieve the above purpose, in some embodiments, the following technical solutions are adopted in the present invention:

a method of predicting a protein-polypeptide binding site, comprising:

In other embodiments, the invention adopts the following technical scheme:

a system for predicting a protein-polypeptide binding site, comprising:

the data acquisition module is used for acquiring protein-polypeptide sequence data to be predicted;

wherein the neural network model based on the pre-training model BERT and the contrast learning comprises:

a sequence embedding module for converting each amino acid in the original protein-polypeptide sequence into an embedding matrix;

a BERT-based encoder module for passing the embedded matrix through a BERT encoding and fully connected neural network layer to obtain a low-dimensional representation matrix of each amino acid;

the comparison learning module is used for carrying out constraint through constructing comparison loss when BERT coding is carried out;

and the output module is used for generating the distinguishing combined and non-combined site representation characteristics.

In other embodiments, the invention adopts the following technical scheme:

a terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; a computer readable storage medium stores a plurality of instructions adapted to be loaded by a processor and to perform the method for predicting a protein-polypeptide binding site described above.

In other embodiments, the invention adopts the following technical scheme:

a computer readable storage medium having stored thereon a plurality of instructions, wherein the instructions are adapted to be loaded by a processor of a terminal device and to perform the method for predicting a protein-polypeptide binding site as described above.

The invention has the beneficial effects that:

1. the present invention proposes a protein sequence-only prediction method that is superior to the latest protein structure-based prediction methods in many evaluation indices.

2. Compared with the traditional manual feature-based method, the method can automatically extract features instead of being based on the existing experience. Thus, problems caused by the prediction tool can be avoided well.

3. The invention provides a novel comparison learning-based mode aiming at the unbalanced classification problem. It can adaptively learn high-quality characterization of binding and non-binding sites, and the method of the present invention can make full use of most samples compared to conventional undersampling methods.

Drawings

FIG. 1 is a schematic structural diagram of a deep neural network based on a pre-training model BERT and comparative learning according to an embodiment of the present invention;

FIG. 2 is a graph of MCC results compared to prior art methods in accordance with an embodiment of the invention;

FIGS. 3(a) - (b) are ROC plots comparing an embodiment of the present invention with a prior art method;

FIG. 4 is a graph of the results of an ablation contrast learning module in an embodiment of the present invention;

FIG. 5 is a case visualization result diagram in an embodiment of the present invention;

FIGS. 6(a) - (b) are graphs showing the results of specificity experiments in examples of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In one or more embodiments, a method of predicting a protein-polypeptide binding site (pepbccl) is disclosed, comprising the steps of:

the neural network model based on the pre-training model BERT and the contrast learning firstly encodes an original protein sequence into a digital value vector; the specific method comprises the following steps: each amino acid letter in the original protein sequence is first capitalized and translated into a numeric sequence according to a defined lexical dictionary, where each amino acid in the sequence is considered a word in a sentence and is mapped to a numeric value.

The encoded vector of digital values is embedded by an embedding layer that is pre-trained over a large number of protein sequences to generate an initial embedding matrix. After each amino acid in the original protein-polypeptide sequence is converted into an embedded matrix, the multi-angle context representation of the protein sequence is learned through a multi-head attention mechanism, and a feed-forward network is added to extract better context representation through an activation function; and then applying residual connecting technology and layer normalization to obtain BERT coding output.

The concrete process of carrying out BERT coding is as follows:

learning multi-angle context representation of a protein sequence through a multi-head attention mechanism, and adding a feed-forward network to extract better context representation through an activation function; and then applying residual connecting technology and layer normalization to obtain BERT coding output. And the embedded matrix is subjected to BERT coding and full-connection neural network layer to obtain a low-dimensional expression matrix of each amino acid.

In many of the comparison frameworks that have been proposed, the use of more negative examples can greatly improve the performance of the model. In view of this, when performing BERT coding, the set number of expression matrices are collected by constructing contrast loss for constraint so as to obtain sufficient site-level data for contrast learning; contrast loss is constructed as a loss function for batch data, with samples of the same class having similar representations and samples of different classes having different representations. Finally, a distinctive binding and non-binding site indicative signature is generated.

Specifically, with reference to fig. 1, in this embodiment, the neural network model based on the pre-training model BERT and the comparative learning specifically includes: the device comprises a sequence embedding module, a BERT-based encoder module, a comparison learning module and an output module.

In the sequence embedding module, each amino acid in the original protein sequence is converted into a pre-trained embedding vector. Thus, the entire protein sequence input is transformed into an embedded matrix. In the BERT-based encoder module, the embedded matrix of the input sequence is first encoded by the depth pre-training model BERT, generating a high-dimensional feature representation with mutual attention information. Subsequently, through a layer of FNNs (fully-connected neural networks), a better low-dimensional representation of each amino acid in the protein sequence can be obtained. In many of the comparison frameworks that have been proposed, the use of more negative examples can greatly improve the performance of the model. In view of this, the present embodiment proposes a new contrast learning module that can calculate the contrast loss between the positive sample-positive sample pair, the negative sample-negative sample pair, and the positive sample-negative sample pair of a set amount of data to constrain the encoder module to generate more distinctive binding and non-binding site representative features. Finally, the output module can generate site-level polypeptide binding probabilities and determine whether individual sites in the input sequence bind.

In this embodiment, a specific method for constructing the sequence embedding module includes:

each amino acid letter in the original protein sequence is first capitalized and translated into a numeric sequence according to a defined lexical dictionary, where each amino acid in the sequence can be considered a word in a sentence and mapped to a numeric value. For example, S (serine) corresponds to the number 11 and L (leucine) corresponds to the number 6. It is noted that the rare amino acids will be uniformly replaced with the corresponding numbers 26 in the dictionary. Given that it is not a large data set, especially due to the performance degradation problem caused by over-filling, we did not fill the protein sequences to the same length. Thus, the original protein sequence is encoded as a vector of numerical values. The encoded vector is then embedded by an embedding layer that is pre-trained over a large number of protein sequences to generate an initial embedding that is better than a generic embedding layer.

In this embodiment, a specific method for constructing a BERT-based encoder module includes:

the basic unit of the BERT model is an encoder block consisting of a multi-headed attention mechanism, a feed-forward network, and a residual concatenation technique. The multi-point attention mechanism consists of a number of independent self-attention modules for learning multi-angle context representations of protein sequences. The self-attention mechanism is described as follows:

whereinIs the output of the sequence embedding module and passes through the linear layers respectivelyConversion to a query matrixKey matrixSum matrixL is the length of the input protein sequence, d_mIs the initial embedding dimension, d_kIs the dimension of matrix Q, K and V.

The multi-head attention mechanism is based on the self-attention mechanism and can be expressed as follows:

whereinThe query matrix, the key matrix and the value matrix of the ith head respectively correspond to a linear transformation layer, and h represents the number of the heads.Is a linear translation layer that can map the output dimension of multi-head attention to the initial embedding dimension of the embedding module. Then, residual join technique and Layer Normalization (LN), X, are applied_MultiHeadIs the final output of the multi-headed attention module.

The Feed Forward Network (FFN) is added to extract a better representation by activating a function, which is mathematically described as follows:

wherein X_MultiHeadIs the output of the multi-head attention mechanism,andare two linear layers and are shared at all locations. d_mIs the initial embedding dimension, d_fIs the dimension of the forward network hidden layer. gelu (Gaussian Error Linear units) is a nonlinear activation function, and the output of the feedforward network also applies the residual Error concatenation technique and is subjected to layer normalization.

Since the BERT model has many encoder blocks, the final encoding process of BERT can be expressed as follows:

X⁽ⁱ⁾＝FFN(MultiHead(X^(i-1))),i＝1,...,n(5)

wherein, X⁽ⁱ⁾Is the output of the ith encoder block and n represents the total number of encoder blocks. X⁽⁰⁾Is the initial input embedding matrix, here for convenience we consider both multi-headed attention and FFN to include residual concatenation techniques and LNs.

After the BERT model is coded, we will get the output X of the last encoder block⁽ⁿ⁾The dimensionality is still high. Thus, to avoid redundancy in dimensionality, FNNs (fully-connected neural networks) are used as follows to better extract representations of the amino acids in the input sequence, while reducing dimensionality.

X_Encode＝elu(X⁽ⁿ⁾W⁽³⁾)W⁽⁴⁾ (6)

WhereinAndis a Linear layer of FNN, and elu (explicit Linear units) is a popular non-Linear activation function. d₁,d₂Hidden layer dimensions of the first and second layers of the FNN, respectively. In this way, a better low dimensional representation of each amino acid in the input sequence is obtained.

In this embodiment, a specific method for constructing the comparative learning module is as follows:

this embodiment proposes a novel comparative learning module based on supervised data, such that representations of the same category input are mapped to close points in the representation space, while different category inputs are mapped to far away. In particular, considering that the protein sequences are not padded to the same length, the present embodiment will first collect a set number of representation matrices from the encoder module. In this way, sufficient site-level data can be obtained for comparative learning. Subsequently, in order for samples of the same class to have similar representations and samples of different classes to have different representations, the present embodiment constructs the contrast loss as a function of the loss of our model for the bulk data. For a pair of site representations, the loss is defined as follows:

wherein a pair of sites represents z₁,z₂Can be represented by D (z)₁,z₂) To measure. If the pair of sites belong to different classes, y equals 1, which means that one site is binding and the other is not; if the pair of bits belong to the same class, y is equal to 0. D_maxIs D (z)₁,z₂) Here equal to 2. It is worth noting that by giving a higher weight of 3 to pairs of different category loci, the model is indirectly more focused on the minority categories.

In this embodiment, a specific method for constructing the output module includes:

generated from the original protein sequence x by the preceding modulesThe locus representation vector z is fed into a multi-level perceptron (MLP) which converts the feature vectors to a locus-level class output y_pThat is to say that,

wherein Sequence-embedded denotes a Sequence embedding module, and BERT-based-Encode denotes a BERT-based encoder module. x is the number of_EncodeIs a coded sequence level representation consisting of a number of locus feature vectors, x_Encode,iIs the ith site in the sequence and n is the number of sites in the sequence.

Where the output module is trained using a cross-entropy loss function to improve prediction performance, i.e.,

wherein k ═ 0 or 1 represents a non-binding site or a binding site, and p_kIs the probability of considering a locus as class k. N is the number of sites, y_iIs a tag of position i, L₂Representing the cross-entropy loss of a set amount of data.

In order to avoid L₂The lost back-propagation interference represents the learning module and the gradient vanishing problem due to the depth model BERT, and the optimization and prediction parts representing the learning part are separated. In particular, parameters in the BERT based encoder module are frozen while the output module is trained. The loss function of the model can be described as follows:

in this embodiment, in order to better evaluate the overall performance of the method proposed in this embodiment, four indexes commonly used in the unbalanced classification task are selected and used, including Recall (Recall), Specificity (Specificity), accuracy (Precision), and Mausoleum Correlation Coefficient (MCC). Their calculation formula is as follows:

where TP (true positive) and TN (true negative) indicate the number of correctly predicted binding and non-binding residues, FP (false positive) and FN (false negative) indicate the number of incorrectly predicted binding and non-binding residues. Recall refers to the proportion of binding residues that the model correctly predicts, Specificity refers to the proportion of non-binding residues that the model correctly predicts. Precision indicates the accuracy of prediction of the residue predicted to bind. MCC is a comprehensive metric that considers the predicted behavior of both binding and non-binding residues and is widely used in unbalanced datasets. In addition, AUC (area under ROC (receiver operating characteristic) curve) was also calculated to measure the overall performance of the neural network model.

The performance of the method of this example is verified by experiments as follows

To evaluate the performance of the method pepbccl of this example, two data sets widely used in the previous methods were first constructed and experiments were performed using the constructed neural network model based on the pre-trained model BERT and comparative learning.

The specific data set is as follows:

(1) reference data set

The data set proposed in the SPRINT-Seq method, which contained 1,279 peptide binding proteins, containing 16,749(290,943) polypeptide binding (non-binding) residues, was selected as our baseline data set. Specifically, the data set is obtained by processing the following two steps:

obtaining and collecting protein-polypeptide data from a BioLiP database;

② by BLAST package "BLAstClost" clustering and screening to remove sequence identity > 30% protein.

(2) Comparing the experimental data sets:

preparing Dataset1 and Dataset 2; test sets (denoted by TS 125) are collected from the protein structure-based method SPRINT-Str, and the reference data set is divided into training sets (denoted by TR 1154), so that TR1154 and TS125 are used as the training set and test set of Dataset 1. To further evaluate the performance of the method PepBCL proposed in this example with the latest methods (PepBind, PepNN-Seq, PepNN-Struct), we also obtained the same training set (denoted by TR 640) and test set (denoted by TS 639) as the PepBind method as the training set and test set of Dataset 2.

(3) Specific experimental data set

From the article "A comprehensive view of sequence-based predictors of DNA-and RNA-binding responses", 30 DNA-binding proteins (designated DNA30) and 30 RNA-binding proteins (designated RNA30) were randomly selected;

from the article "StackCBPred: A stacking based prediction of protein-carbohydrate binding proteins from sequence" 30 carbohydrate binding proteins (named CBH30) were randomly selected. The three data sets obtained (DNA30, RNA30, and CBH30) were used as the data sets for our specific experiments.

Based on the above acquired data set, we compared the method pepbccl of the present embodiment with the existing methods including the conventional machine learning method and the plurality of latest methods. The evaluation indexes are AUC and MCC representing the comprehensive performance of the model, as shown in fig. 2 and fig. 3, the final test set prediction evaluation result is obtained. FIG. 2 is a line graph of MCC in test set TS125 for PepBCL and other prior art methods, and FIG. 3(a) is a graph of ROC in test set TS125 for PepBCL and other prior art methods; FIG. 3(b) is a ROC plot of PepBCL and the latest method PepBind on test set TS 639. In order to verify that the comparative learning module provided by the embodiment can help the model to extract higher-quality features, an ablation experiment is performed, namely, a neural network model based on a pre-training model BERT and comparative learning, which is the same as that in the embodiment, is firstly constructed, then an ablation network lacking the comparative learning module is constructed, and the two networks are tested on comparative experiment data sets Dataset1 and Dataset 2.

For a complete network we pass the minimumChange of contrast loss functionAnd cross entropy loss functionTo optimize network parameters; for ablation networks, we only pass through minimizing the cross-entropy loss functionTo optimize network parameters. Finally, the high-dimensional features obtained by the two networks on the test set are subjected to dimension reduction and visualization through a t-SNE tool, samples of different types are respectively marked with different colors, and the more clear the two colors, the better the features obtained by the model are, and the higher the quality is.

FIG. 4 is a t-SNE visualization of the feature space distribution of the PepBCL model in its entirety and in the absence of the comparison module. Wherein (A) and (B) represent t-SNE visualizations of PepBCL on Dataset1 with and without the use of a comparison module; (C) and (D) shows the t-SNE visualization results of PepBCL with and without the use of the comparison module on Dataset 2. The results shown in fig. 4 demonstrate that the comparative learning framework proposed by the present embodiment learns a high quality representation and improves prediction performance.

To further visualize the advantages of the neural network model of this example, two proteins were first randomly selected in the test set, their PDB ids 4l3oA and 1 fchA. Then, by using the neural network model and the comparison method PepBind of the present embodiment, comparison experiments are respectively performed on the two proteins, and the predicted results are visualized by a visualization tool, as shown in fig. 5, two different colors represent binding and non-binding residues, and the more similar the prediction condition of the true binding residues obtained from the biological experiment, the better the prediction effect.

FIG. 5 shows a visual representation of the predicted results of PepBCL and the prior art method on two randomly selected proteins (pdbID: 4l3oA and 1 fchA). (A) - (C) represent the actual binding residue, the predicted binding residue of PepBCL and the predicted binding residue of PepBind, respectively, obtained from biological experiments on protein 4l3 oA; (D) - (F) represent the actual binding residue, the predicted binding residue of PepBCL and the predicted binding residue of PepBind, respectively, obtained from biological experiments on the protein 1 fchA.

In order to verify the specificity of the neural network model of the present example to the recognition of protein-polypeptide binding sites, comparative experiments were performed on four data sets, namely data set Dataset1 and data set DNA30, RNA30 and CBH30, using the neural network model of the present example, and evaluation was performed using evaluation indexes.

FIGS. 6(a) - (b) show the predicted performance of PepBCL in this example for binding sites of proteins binding to different ligands (polypeptide, DNA, RNA, carbohydrate). FIG. 6(a) shows the recall (recall), Precision (Precision) and MCC on different ligand binding protein datasets for PepBCL of this example. FIG. 6(b) shows a ROC plot of the method of this example on four different ligand binding protein datasets.

The results shown in FIGS. 6(a) - (b) demonstrate that the model PepBCL of this example is specific for the recognition of protein-polypeptide binding sites.

This example first applied contrast learning to the problem of predicting protein-polypeptide binding sites, and combined with the pretrained model BERT as an encoder for protein sequences, yielded good results on multiple test sets. Meanwhile, a plurality of comparison experiments are ingeniously designed, and the comprehensive performance of the method is effectively verified.

Example two

In one or more embodiments, a system for predicting a protein-polypeptide binding site is disclosed, comprising:

the data acquisition module is used for acquiring protein-polypeptide sequence data to be predicted;

wherein the neural network model based on the pre-training model BERT and the contrast learning comprises:

a sequence embedding module for converting each amino acid in the original protein-polypeptide sequence into an embedding matrix;

a BERT-based encoder module for passing the embedded matrix through a BERT encoding and fully connected neural network layer to obtain a low-dimensional representation matrix of each amino acid;

the comparison learning module is used for carrying out constraint through constructing comparison loss when BERT coding is carried out;

and the output module is used for generating the distinguishing combined and non-combined site representation characteristics.

The specific implementation manner of each module is implemented by using the method disclosed in the first embodiment, and is not described again.

EXAMPLE III

In one or more embodiments, a terminal device is disclosed, comprising a server comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for predicting a protein-polypeptide binding site of the first embodiment when executing the program. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method for predicting a protein-polypeptide binding site in the first embodiment can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

16页详细技术资料下载

Method and system for predicting protein-polypeptide binding site

相关技术

网友询问留言