Method and device for predicting ligand-protein interaction

文档序号：170913 发布日期：2021-10-29 浏览：34次中文

阅读说明：本技术 一种配体-蛋白质相互作用的预测方法及装置 (Method and device for predicting ligand-protein interaction ) 是由蒋华良郑明月陈立凡于 2020-04-29 设计创作，主要内容包括：本发明公开了一种配体-蛋白质相互作用的预测方法及装置,包括：对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列；基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列；基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率。本发明实施例中当需要预测某个蛋白质和某个配体能否进行相互作用时,只需要获得该蛋白质的各蛋白质特征序列以及该配体的原子特征序列,通过利用预测模型,就能预测出蛋白中哪些氨基酸片段能和配体中哪些原子进行相互作用,由此能够计算出该蛋白质和该配体相互作用的概率。(The invention discloses a method and a device for predicting ligand-protein interaction, which comprises the following steps: processing a primary sequence of a target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors; acquiring a plurality of atomic characteristic sequences of a target ligand based on a molecular fingerprint of the target ligand; and predicting by using a preset prediction model based on the plurality of protein characteristic sequences and the plurality of atom characteristic sequences to obtain the interaction probability of the target protein and the target ligand. In the embodiment of the invention, when it is required to predict whether a certain protein and a certain ligand can interact, only the characteristic sequences of each protein of the protein and the atomic characteristic sequences of the ligand need to be obtained, and by using a prediction model, which amino acid fragments in the protein can interact with which atoms in the ligand can be predicted, so that the probability of the interaction between the protein and the ligand can be calculated.)

1. A method for predicting a ligand-protein interaction, comprising the steps of:

processing a primary sequence of a target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors;

acquiring a plurality of atomic characteristic sequences of a target ligand based on a molecular fingerprint of the target ligand;

and predicting by using a preset prediction model based on the plurality of protein characteristic sequences and the plurality of atom characteristic sequences to obtain the interaction probability of the target protein and the target ligand.

2. The method according to claim 1, wherein the processing of the primary sequence of the target protein to obtain a plurality of protein signature sequences consisting of signature vectors comprises:

dividing the primary sequence of the target protein into a plurality of sequence segments by taking a continuous preset number of amino acids as a group;

and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

3. The method of claim 1, wherein obtaining a plurality of atomic signature sequences of the target ligand based on the molecular fingerprint of the target ligand comprises:

processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand;

and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

4. The method of claim 1, wherein the predicting based on the plurality of protein signature sequences and the plurality of atomic signature sequences using a predetermined prediction model to obtain the probability of the target protein interacting with the target ligand comprises:

processing the protein characteristic sequences and the atom characteristic sequences by adopting an attention mechanism to determine a target characteristic sequence capable of interacting;

and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

5. The method of claim 1, wherein the method further comprises: training by adopting a deep learning method to obtain the prediction model, and specifically comprising the following steps:

acquiring experimental data;

determining a true value for a sample protein-sample ligand interaction based on the experimental data;

obtaining a plurality of protein characteristic sequences of sample protein, and obtaining a plurality of atom characteristic sequences of a sample ligand;

and carrying out model training based on the plurality of protein characteristic sequences of the sample protein, the plurality of atomic characteristic sequences of the sample ligand and the true value to obtain the prediction model.

6. The method of claim 5, wherein the model training based on the plurality of protein signature sequences of the sample protein, the plurality of atomic signature sequences of the sample ligand, and the true values to obtain the predictive model comprises:

processing a plurality of protein characteristic sequences of the sample protein and a plurality of atomic characteristic sequences of the sample ligand by using a self-attention mechanism to obtain a plurality of sample sequences containing interaction information;

calculating the plurality of sample sequences by using a preset calculation formula to obtain interaction characteristics;

processing the interaction characteristics by utilizing a fully-connected neural network to obtain a predicted value of the interaction between the sample protein and the sample ligand;

calculating a cross entropy based on the predicted value and the true value;

and taking the cross entropy as a loss function of a prediction model, and training by adopting a random gradient descent method to obtain the prediction model.

7. A prediction device for ligand-protein interaction, comprising:

the first acquisition module is used for processing the primary sequence of the target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors;

the second acquisition module is used for acquiring a plurality of atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;

and the prediction module is used for predicting by utilizing a preset prediction model based on the plurality of protein characteristic sequences and the plurality of atom characteristic sequences to obtain the interaction probability of the target protein and the target ligand.

8. The apparatus of claim 7, wherein the first obtaining module is specifically configured to:

dividing the primary sequence of the target protein into a plurality of sequence segments by taking a continuous preset number of amino acids as a group;

and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

9. The apparatus of claim 7, wherein the second obtaining module is specifically configured to: processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand;

and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

10. The apparatus of claim 7, wherein the prediction module is specifically configured to:

processing the protein characteristic sequences and the atom characteristic sequences by adopting an attention mechanism to determine a target characteristic sequence capable of interacting;

and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

Technical Field

The invention relates to the field of drug screening, in particular to a method and a device for predicting ligand-protein interaction.

Background

Virtual screening is an important task in the development of early drugs, and is divided into three categories: structure-based virtual screening, ligand-based virtual screening, and chemogenomics-based virtual screening. The virtual screening based on the structure needs the crystal structure of the protein, and a plurality of potential target proteins do not solve the crystal structure, so the virtual screening based on the structure cannot solve the drug screening work of the target. Virtual screening based on ligands requires more ligand information, and the number of active small molecules reported by many targets is too small to accurately and reliably establish a model. In addition, ligand-based virtual screening also limits the discovery and design efforts for active small molecules of novel structures. In view of the limitations of both structure-based virtual screening and ligand-based virtual screening, a number of chemical genome-based machine learning approaches have been proposed to predict ligand-protein interactions, with the drawback of these approaches being the need to artificially define descriptors for proteins and small molecules.

Since machine learning models require the definition of descriptors for proteins and small molecules. Models cannot autonomously learn the characteristics of proteins and small molecules from data end-to-end, while machine learning is poorly able to learn large samples.

In addition, the existing deep learning model does not extract the real interaction characteristics, which leads to the misleading of the model by the statistic rule irrelevant to the task, thus failing to obtain good effect in practical application and accurately predicting the ligand-protein interaction relationship.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for predicting ligand-protein interaction, which are used for solving the problem that the interaction relation of ligand and protein cannot be accurately predicted in the prior art.

In order to solve the technical problem, the embodiment of the application adopts the following technical scheme: a method for predicting ligand-protein interactions, comprising the steps of:

processing a primary sequence of a target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors;

acquiring a plurality of atomic characteristic sequences of a target ligand based on a molecular fingerprint of the target ligand;

Optionally, the processing the primary sequence of the target protein to obtain a plurality of protein feature sequences composed of feature vectors specifically includes:

dividing the primary sequence of the target protein into a plurality of sequence segments by taking a continuous preset number of amino acids as a group;

and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

Optionally, the obtaining of a plurality of atomic feature sequences of the target ligand based on the molecular fingerprint of the target ligand specifically includes:

processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand;

and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

Optionally, the predicting based on the plurality of protein feature sequences and the plurality of atomic feature sequences by using a preset prediction model to obtain the probability of the interaction between the target protein and the target ligand specifically includes:

processing the protein characteristic sequences and the atom characteristic sequences by adopting an attention mechanism to determine a target characteristic sequence capable of interacting;

and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

Optionally, the method further includes: training by adopting a deep learning method to obtain the prediction model, and specifically comprising the following steps:

acquiring experimental data;

determining a true value for a sample protein-sample ligand interaction based on the experimental data;

obtaining a plurality of protein characteristic sequences of sample protein, and obtaining a plurality of atom characteristic sequences of a sample ligand;

Optionally, the model training is performed based on the plurality of protein feature sequences of the sample protein, the plurality of atomic feature sequences of the sample ligand, and the true value, so as to obtain the prediction model, which specifically includes:

calculating the plurality of sample sequences by using a preset calculation formula to obtain interaction characteristics;

processing the interaction characteristics by utilizing a fully-connected neural network to obtain a predicted value of the interaction between the sample protein and the sample ligand;

calculating a cross entropy based on the predicted value and the true value;

and taking the cross entropy as a loss function of a prediction model, and training by adopting a random gradient descent method to obtain the prediction model.

In order to solve the technical problem, the embodiment of the application adopts the following technical scheme: a prediction device of ligand-protein interaction comprising:

the first acquisition module is used for processing the primary sequence of the target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors;

the second acquisition module is used for acquiring a plurality of atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;

a prediction module for predicting based on the plurality of protein characteristic sequences and the plurality of atom characteristic sequences by using a preset prediction model to obtain the probability of the interaction between the target protein and the target ligand

Optionally, the first obtaining module is specifically configured to:

dividing the primary sequence of the target protein into a plurality of sequence segments by taking a continuous preset number of amino acids as a group;

and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

Optionally, the second obtaining module is specifically configured to: processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand;

and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

Optionally, the prediction module is specifically configured to:

processing the protein characteristic sequences and the atom characteristic sequences by adopting an attention mechanism to determine a target characteristic sequence capable of interacting;

and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

The embodiment of the invention has the beneficial effects that: the prediction model is obtained through pre-training, so that when the interaction between a certain protein and a certain ligand needs to be predicted, only the protein characteristic sequences of the protein and the atomic characteristic sequences of the ligand need to be obtained, and the prediction model can be used for predicting which protein characteristic sequences in the protein can interact with which atomic characteristic sequences in the ligand, so that the probability of the interaction between the protein and the ligand can be calculated, and the prediction of the interaction between the protein and the ligand is more accurate.

Drawings

FIG. 1 is a flow chart of a method for predicting ligand-protein interactions in an embodiment of the invention.

FIG. 2 is a schematic diagram of the prediction of ligand-protein interactions in an example of the present invention;

FIG. 3 is a flowchart illustrating the specific steps for obtaining the interaction signature sequence according to an embodiment of the present invention;

FIG. 4 is a block diagram showing the structure of a prediction apparatus for ligand-protein interaction in the embodiment of the present invention.

Detailed Description

Various aspects and features of the present application are described herein with reference to the drawings.

It will be understood that various modifications may be made to the embodiments of the present application. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.

These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.

It should also be understood that, although the present application has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of application, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.

The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.

The embodiment of the invention provides a method for predicting ligand-protein interaction, as shown in figure 1, comprising the following steps:

step S101, processing the primary sequence of the target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors.

In the specific implementation process of the step, the word vector embedding method (word2vec) in natural language processing can be utilized to process the amino acid sequences of the protein into a group of sequences consisting of characteristic vectors, namely, a plurality of protein characteristic sequences p are obtained₁,p₂,…,p_b。

Step S102, acquiring a plurality of atom characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand.

In the process of the specific embodiment of the step, the graph molecule fingerprint of the target ligand can be encoded by using a chemical information package RDkit, and a plurality of atom characteristic sequences c of the target ligand can be learned through a graph convolution network₁,c₂,…,c_a。

Step S103, predicting by using a preset prediction model based on the plurality of protein characteristic sequences and the plurality of atom characteristic sequences to obtain the probability of the interaction between the target protein and the target ligand.

In the specific implementation process of the step, a plurality of characteristic sequences (protein characteristic sequences) p of the protein are obtained₁,p₂,…,p_bAnd several atomic character sequences c of the ligands₁,c₂,…,c_aThereafter, it can be processed by natural languageThe Transformer framework in (1) encodes and decodes (in a prediction model), and outputs the target characteristic sequence x of the interaction₁,x₂,…,x_a(ii) a Then, calculation is performed based on the target signature sequence, so that the probability of the target protein binding to the target ligand can be obtained.

In the embodiment of the invention, when it is required to predict whether a certain protein and a certain ligand can interact, each protein characteristic sequence of the protein and the atomic characteristic sequence of the ligand are only required to be obtained, and by using a prediction model, which protein characteristic sequences can interact with which atomic characteristic sequences can be predicted, so that the probability of the interaction between the protein and the ligand can be calculated.

In another embodiment, the present invention provides a method for predicting a ligand-protein interaction, comprising the steps of:

step S201, taking continuous predetermined number of amino acids as a group to divide the primary sequence of the target protein into a plurality of sequence segments; and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

In the specific embodiment of the present invention, the amino acid sequence of the target protein may be divided into b segments (b is amino acid length-2) by using three consecutive amino acids as a group, and then the b amino acid segments are encoded into the signature sequence p by using word2vec algorithm₁,p₂,…,p_b。

Step S202, processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand; and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

In the specific implementation process of the step, the SMILES formula of the molecule can be processed by adopting an RDKit package, and each atom codes a 34-dimensional characteristic vector to obtain the molecular fingerprint of the small molecule; processing the graph molecule fingerprints through a graph convolution neural network to obtain an atom characteristic sequence c₁,c₂,…,c_a(a is the number of non-hydrogen atoms in the molecule).

Step S203, processing the plurality of protein characteristic sequences and the plurality of atom characteristic sequences by adopting a self-attention mechanism to determine a target characteristic sequence capable of interacting; and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

In the specific implementation process, a preset calculation formula is used for calculating the target feature sequences to obtain interaction features; and then processing the interaction characteristics by using a fully-connected neural network to obtain a predicted value (probability) of the sample protein-sample ligand interaction. More specifically, several characteristic sequences (protein characteristic sequences) p of the protein are obtained₁,p₂,…,p_bAnd several atomic character sequences c of the ligands₁,c₂,…,c_aThen, the target characteristic sequence x of the interaction can be output by encoding and decoding through a Transformer framework in natural language processing₁,x₂,…,x_a(ii) a Then, calculating the target characteristic sequence by using a preset calculation formula to obtain interaction characteristics; and finally, processing the interaction characteristics by using a fully-connected neural network to obtain the binding probability of the target protein and the target ligand.

The embodiment provides a method for predicting ligand-protein interaction, which further comprises training to obtain a prediction model by using a deep learning method before predicting the interaction between a target protein and a target ligand. The method specifically comprises the following steps:

step S301, acquiring experimental data;

step S302, determining the true value of the sample protein-sample ligand interaction based on the experimental data;

in the specific implementation process of the step, a true value y of the interaction can be obtained according to actual experimental data and results, wherein the true value y is specifically '1' or '0', wherein 1 represents that the interaction can be carried out, and 0 represents that the interaction cannot be carried out.

Step S303, acquiring a plurality of protein characteristic sequences of sample protein, and acquiring a plurality of atomic characteristic sequences of a sample ligand;

in the specific embodiment process of the present step, the primary sequence of the sample protein may be processed to obtain a plurality of protein feature sequences composed of feature vectors. For example, the amino acid sequence of the sample protein is divided into b fragments (b is amino acid length-2) by using three continuous amino acids as a group, and then the b amino acid fragments are coded into a group of sequences p consisting of feature vectors by using a word vector embedding method (word2vec) in natural language processing₁,p₂,…,p_bThe set of sequences includes a plurality of protein signature sequences, such as p₁I.e. representing a protein signature sequence. Specifically, a protein with an amino acid length of 200 can be selected from experimental data, that is, the dimension is obtained as follows: 198X 100 protein signature sequence.

In the step, when the atomic characteristic sequence of the sample ligand is obtained, a plurality of atomic characteristic sequences of the sample ligand can be obtained based on the molecular fingerprint spectrum of the sample ligand. More specifically, the chemical information package RDkit may be used to process the SMILES formula of the sample ligand, each atom encodes a 34-dimensional feature vector (as shown in table 1), so as to obtain a molecular fingerprint of the ligand, and then the molecular fingerprint is processed by the convolution network, so as to obtain a plurality of atomic feature sequences c of the sample ligand₁,c₂,…,c_a(a is the number of non-hydrogen atoms in the molecule). Specifically, a sample ligand with a non-hydrogen atom number of 20 can be selected from experimental data, that is, the obtained dimensions are: 20 × 64 atomic signature sequence.

TABLE 1

Step S304, model training is carried out based on the plurality of protein characteristic sequences of the sample protein, the plurality of atom characteristic sequences of the sample ligand and the true value, and the prediction model is obtained.

In the process of the specific embodiment, the steps can be specifically divided into the following steps:

step S3041, a self-attention mechanism is used to process the protein feature sequences of the sample protein and the atomic feature sequences of the sample ligand, and a plurality of sample sequences capable of interacting are obtained through prediction.

More specifically, as shown in FIG. 2, the protein signature sequence of the sample (i.e., the protein signature sequence of the sample protein), i.e., p with dimension b × 100₁,p₂,…,p_bInputting into a coder for coding, and outputting a coded sample protein characteristic sequence, namely p with dimension b × 64₁,p₂,…,p_b. Then, the atomic characteristic sequence of the sample ligand, namely c with the dimension of a × 64₁,c₂,…,c_aAnd (encoded sample protein signature sequence) p with dimension b × 64₁,p₂,…,p_bInputting the data into a decoder for learning, and finally outputting an interaction feature sequence (i.e. a plurality of sample sequences) with the dimension of a x 64 through the learning of a Transformer decoder₁,x₂,…,x_a；

Step S2042, calculating the plurality of sample sequences by using a preset calculation formula to obtain interaction characteristics;

in the specific implementation of the step, the following three calculation formulas are adopted to calculate and obtain the interaction characteristics:

wherein x is_iIs a vector x_iA mode of (a)_iIs a vector x_iThe weight of (c). x is the number of_iDenotes the ith interaction signature sequence, y_interactionIndicating the interaction characteristics.

Step S3043, processing the interaction characteristics by using a fully-connected neural network to obtain a predicted value of the interaction between the sample protein and the sample ligand;

in this step, the interaction characteristic y is obtained_interactionThen, y can be_interactionInputting to a fully-connected neural network, and outputting a predicted value

Step S3044, calculating a cross entropy based on the predicted value and the true value;

this step is to obtain the predicted valueThen, a predicted value is calculatedAnd the cross entropy of the true value y.

Step S3045, training the cross entropy as a loss function of the prediction model by using a random gradient descent method, to obtain the prediction model.

The random gradient descent method used for training the model in this step is a common model training method, and is not described herein again.

In this example, the sample protein signature sequence (i.e., the protein signature sequence of the sample protein), i.e., p with dimension b × 100₁,p₂,…,p_bWhen the coded sample protein characteristic sequence is output after being input into a coder for coding, a formula in the coder is utilized To be processed, whereinIs h_lThe input of the layer(s) is (are), W₁、s、W₂t is a learnable parameter, n is the length of the sequence, m₁，m₂Respectively, the dimensions of the input and hidden layer features, k is the size of the convolution kernel, σ is the sigmoid function,is the Hadamard product of the matrix. Setting parameters: k is 7, m₁＝100(m₁Dimension representing input layer features), m₂＝64(m₂The dimension representing the hidden layer feature). Namely inputThen through one-dimensional convolution and gate linear unit calculationAnd updates the protein characteristic sequence p₁,p₂,…,p_bFinally, outputting the characteristic sequence p of the encoded protein₁,p₂,…,p_b。

In this example, the atomic feature sequence of the ligand (c with a dimension of a × 64) of the sample is determined₁,c₂,…,c_a) And the characteristic sequence of the encoded sample protein (p with dimension b × 64)₁,p₂,…,p_b) Input to decoder for learning, and output interaction characteristic sequence (i.e. several sample sequences) x₁,x₂,…,x_aSpecifically, the following method can be adopted, namely, the calculation formula of the self-attention layer is adopted:to calculate an attention value (attention). Wherein d is_kRepresents a scaling factor, which is the dimension of the hidden layer feature, 64 in this embodiment; t denotes the transposed sign of the matrix. Specifically, as shown in FIG. 3, the atomic feature sequence of the sample ligand can be used as the self-attention layer (i.e., formula)) Computing attention values of the atomic feature sequences, and performing weighted summation and normalization computation, wherein Q, K and V are c₁,c₂,…,c_a. Then, the calculation result is used as the input of the second layer (self-attention layer), and the characteristic sequence of the protein (protein characteristic sequence) is used as the input of the second layer, the attention values of the atomic characteristic sequence and the protein characteristic sequence are calculated by a self-attention mechanism, weighted summation and normalization are carried out, and in this case, Q is c₁,c₂,…,c_a，K＝V＝p₁,p₂,…,p_b. Finally, the obtained result is used as the input of the third layer (namely, the input to the convolutional neural network) to carry out the weighted summation and the normalization calculation for the third time, so that the characteristic sequence (namely, a plurality of sample sequences) x of the interaction can be obtained₁,x₂,…,x_a。

In the embodiment of the invention, an end-to-end deep learning model Transformamer CPI is utilized to obtain the current optimal result on three public reference data sets. The deep learning model transformer cpi in the embodiment obtains the current optimal result in a label reversal experiment (label reverse experiments), and compared with other models, the improved effect is very obvious, and the method is proved to be capable of learning the real interaction characteristics. Meanwhile, because the deep learning model transformer CPI has good interpretability, the probability of which amino acid fragments in the protein are combined with which atom characteristic sequences in the ligand is high, and the contribution of which atoms (atom characteristic sequences) in the ligand molecules to the combination is large can be given, so that a guidance suggestion is given for further molecular structure modification.

In another embodiment, the present invention provides a device for predicting ligand-protein interaction, as shown in FIG. 4, comprising:

the first acquisition module is used for processing the primary sequence of the target protein to obtain a plurality of protein characteristic sequences consisting of characteristic vectors;

the second acquisition module is used for acquiring a plurality of atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;

In this implementation, the first obtaining module is specifically configured to: dividing the primary sequence of the target protein into a plurality of sequence segments by taking a continuous preset number of amino acids as a group; and coding each sequence segment by adopting a preset algorithm to obtain a plurality of protein characteristic sequences consisting of characteristic vectors corresponding to each sequence segment.

In this embodiment, the second obtaining module is specifically configured to: processing the SMILES molecular formula of the target ligand by using a chemical information packet to obtain a molecular fingerprint of the target ligand; and processing the molecular fingerprint by using a graph convolution network to obtain a plurality of atomic characteristic sequences of the target ligand.

Specifically, the prediction module is specifically configured to: processing the protein characteristic sequences and the atom characteristic sequences by adopting an attention mechanism to determine a target characteristic sequence capable of interacting; and calculating to obtain the probability of the target protein being combined with the target ligand based on the target characteristic sequence.

The embodiment further includes a training module for training to obtain the prediction model, where the training module trains to obtain the prediction model by using a deep learning method, and the training module is configured to:

acquiring experimental data;

determining a true value for a sample protein-sample ligand interaction based on the experimental data;

obtaining a plurality of protein characteristic sequences of sample protein, and obtaining a plurality of atom characteristic sequences of a sample ligand;

In a specific implementation process, the training module is specifically configured to:

processing a plurality of protein characteristic sequences of the sample protein and a plurality of atomic characteristic sequences of the sample ligand by using an attention mechanism, and predicting to obtain a plurality of sample sequences capable of interacting;

calculating the plurality of sample sequences by using a preset calculation formula to obtain interaction characteristics;

processing the interaction characteristics by utilizing a fully-connected neural network to obtain a predicted value of the interaction between the sample protein and the sample ligand;

calculating a cross entropy based on the predicted value and the true value;

and taking the cross entropy as a loss function of a prediction model, and training by adopting a random gradient descent method to obtain the prediction model.

In the embodiment of the invention, the interaction probability of the protein and the ligand can be accurately predicted, and the specific combination of the amino acid sequence in the protein and the atom in the ligand can be known, so that a guidance suggestion is provided for further molecular structure modification.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

13页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于多层网络与图编码的药物靶点相互作用预测方法

Method and device for predicting ligand-protein interaction

相关技术

网友询问留言