Method for predicting compound protein affinity based on edge attention mechanism, computer device and storage medium

文档序号：831799 发布日期：2021-03-30 浏览：11次中文

阅读说明：本技术 基于边缘注意力机制的预测化合物蛋白质亲和力方法、计算机设备、存储介质 (Method for predicting compound protein affinity based on edge attention mechanism, computer device and storage medium ) 是由王淑栋刘嘉丽宋弢杜珍珍于 2020-12-18 设计创作，主要内容包括：本发明公开了一种基于边缘注意力机制的预测化合物蛋白质亲和力方法。所述方法包括双向门控循环单元(BiGRU)模型和卷积神经网络(CNN)模型,整个网络架构为BiGRU/BiGRU-CNN,其中BiGRU/BiGRU模型中加入了边缘注意力机制(Marginalized-Attention)。模型的输入为化合物序列与蛋白质序列,二者输入到BiGRU/BiGRU模型里。其中化合物序列表示为加入化合物分子理化性质的SMILES字符串称为SMILES#,蛋白质序列表示由蛋白质的结构属性编码而成。BiGRU/BiGRU输出为经过边缘注意力模型表示的化合物特征向量和蛋白质特征向量。所述的CNN模型由卷积层、池化层、全连接层组成,该模型的输入为化合物特征向量、蛋白质特征向量；该BiGRU/BiGRU-CNN模型的最终输出为预测化合物蛋白质亲和力值的根均方误差值。(The invention discloses a method for predicting compound protein affinity based on an edge attention mechanism. The method comprises a bidirectional gating cyclic unit (BiGRU) model and a Convolutional Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU-CNN, and an edge Attention mechanism (labeled _ Attention) is added into the BiGRU/BiGRU model. The inputs to the model were the compound sequence and the protein sequence, both of which were input into the BiGRU/BiGRU model. The SMILES character string added with the physicochemical property of the compound molecule is expressed by a compound sequence and is called SMILES #, and the protein sequence is encoded by the structural attribute of the protein. The BiGRU/BiGRU output is the compound feature vector and the protein feature vector represented by the edge attention model. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a compound characteristic vector and a protein characteristic vector; the final output of the BiGRU/BiGRU-CNN model is the root mean square error value of the predicted compound protein affinity value.)

1. A method for predicting compound protein affinity based on an edge Attention mechanism is characterized by comprising a bidirectional gating cycle unit (BiGRU) model and a Convolutional Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU-CNN, and the edge Attention mechanism (Marginalized _ Attention) is added into the BiGRU/BiGRU model. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is forward input, the other input is reverse input, and the two-way recurrent neural network only has an input gate and a forgetting gate. The inputs to the model were the compound sequence and the protein sequence, both of which were input into the BiGRU/BiGRU model. The SMILES character string added with the physicochemical property of the compound molecule is expressed by a compound sequence and is called SMILES #, and the protein sequence is encoded by the structural attribute of the protein. The BiGRU/BiGRU output is the compound feature vector and the protein feature vector represented by the edge attention model. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a compound characteristic vector and a protein characteristic vector; the final output of the BiGRU/BiGRU-CNN model is the root mean square error value of the predicted compound protein affinity value.

2. The method of claim 1, wherein the bidirectional gated round robin (BiGRU) model enables data to be input simultaneously from both forward and reverse directions, such that the information at each time includes sequence information of previous and subsequent times, which is equivalent to the increase of sequence information of a network at a specific time, and the information of historical data is fully utilized, thereby making prediction more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.

3. The model of claim 2, wherein said edge Attention mechanism (Marginalized _ Attention) places Attention at the edges of a row or column of a particular matrix of compound-protein interactions to address specific problems of compound-protein interactions that cannot be addressed by the Attention mechanism alone.

4. The attention weight of the model of claim 3, wherein the Convolutional Neural Network (CNN) model consists of convolution, activation, and pooling structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.

5. The entire model of claim 4, wherein the inputs to the method are selected 2 variables, the input variables containing the protein structure attribute sequence from the UniRef database, the compound SMILES # from the STITCH database. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein. Wherein the compound SMILES # sequence is encoded by a SMILES string, a compound topological polar surface area, and a compound complexity.

6. The method of claim 1, wherein the BiGRU/BiGRU-CNN model with edge attention mechanism is trained using existing affinity values for a large number of proteinacious compounds and refined model parameters are obtained.

7. A computer device comprising a memory, a graphics card, a central processing unit, and an executable program stored in the memory and capable of being processed in parallel by the central processing unit and the graphics card, wherein the central processing unit implements the following steps when executing the program: constructing a target detection and target prediction model, wherein the target detection and target prediction model comprises the following steps: feature extraction networks and prediction networks. Firstly, a feature extraction network is utilized to extract features of an input compound SMILES # sequence and a protein structure attribute sequence; and (3) utilizing the extracted characteristic vector matrix to a target prediction model, wherein the target prediction model is to utilize convolution, pooling and full connection to operate the characteristic vector matrix and output root mean square error values of a predicted value and an actual value which are combined with affinity.

8. The computer device of claim 7, wherein the bidirectional gated cyclic unit model enables data to be input simultaneously from both the front and back directions, such that the information at each time includes sequence information of preceding and following times, which is equivalent to the increase of sequence information of the network at a specific time, and the information of historical data is fully utilized, thereby making prediction more accurate. The gating cycle unit carries out sufficient feature extraction on the multivariate time sequence and continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises the following steps: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely gating for controlling reset and gating for controlling update respectively, after a gating signal is obtained, data after reset is obtained through resetting gating, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the functions of forgetting and memorizing through the updating gating, and the more the gating signal is close to 1, the more data representing the memory are.

9. The computer device of claim 7, wherein the inputs to the method are selected 2 variables, the input variables comprising a protein structure attribute sequence from the UniRef database, and a compound SMILES # from the STITCH database. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein. Wherein the compound SMILES # sequence is encoded by a SMILES string, a compound topological polar surface area, and a compound complexity.

10. A storage medium storing a computer program, the memory being characterized in that when the computer program is executed by a central processing unit it carries out the steps of: the system comprises a bidirectional gating cyclic unit (BiGRU) model and a Convolutional Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU-CNN, and an edge Attention mechanism (marked _ Attention) is added into the BiGRU/BiGRU model. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The inputs to the model were the compound sequence and the protein sequence, both of which were input into the BiGRU/BiGRU model. The SMILES character string added with the physicochemical property of the compound molecule is expressed by a compound sequence and is called SMILES #, and the protein sequence is encoded by the structural attribute of the protein. The BiGRU/BiGRU output is the compound feature vector and the protein feature vector represented by the edge attention model. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a compound characteristic vector and a protein characteristic vector; the final output of the BiGRU/BiGRU-CNN model is the root mean square error value of the predicted compound protein affinity value.

Technical Field

The invention relates to the field of molecular structures and properties of compound proteins, in particular to a method for predicting compound protein affinity based on an edge attention mechanism, computer equipment and a storage medium.

Background

The recognition of compound protein interactions is of great importance for the recognition of key compounds. Traditional virtual screening methods such as structure-based virtual screening and ligand-based virtual screening have been studied and have met with great success in drug discovery for a decade. However, in cases where the three-dimensional structure of the protein is unknown or where there are too few data sets of known ligands, conventional virtual screening methods are not applicable. Researchers have later introduced new views of chemogenomics that would allow compound protein interactions to be identified without regard to the three-dimensional structure of the protein. Various machine learning algorithms have been proposed since that consider both compound and protein information in a unified model.

With the rapid development of deep learning, many types of end-to-end frameworks are used for compound protein studies. Compared with the traditional machine learning algorithm, the end-to-end learning combines the representation learning and the model training under a unified architecture, and descriptors do not need to be defined and calculated before modeling. The introduction of deep learning has proven to be one of the best models to predict drug target binding affinity. The main advantage of deep learning is that by performing nonlinear transformations in each layer, they can better represent the raw data and thus facilitate learning patterns hidden in the data. However, many models of compound representation are only molecular fingerprints, a single SMILES string. This can cause the encoded compound signature to lose important information inherent in many compounds, resulting in inaccuracies in the final prediction of the protein affinity value of the compound.

Disclosure of Invention

The invention aims to solve the problems of losing important information of compound molecules and improving the prediction accuracy and the like, and provides a method, a computer device and a storage medium for predicting the protein affinity of the compound based on an edge attention mechanism, which can encode the structural property of the compound molecules into SMILES strings so as to extract more information about the compound molecules, add an attention model, namely the edge attention mechanism, in the characteristic representation process of the compound proteins so as to obtain more accurate characteristic representation vectors, and improve the accuracy of predicting the protein affinity of the compound by using a deep learning method.

According to a first aspect of embodiments of the present invention, there is provided a method for predicting protein affinity of a compound based on the edge attention mechanism.

In some optional embodiments, the method comprises a bidirectional gated cyclic unit (BiGRU) model and a Convolutional Neural Network (CNN) model, and the whole network architecture is BiGRU/BiGRU-CNN, wherein an edge Attention mechanism (Marginalized _ Attention) is added to the BiGRU/BiGRU model. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The inputs to the model were the compound sequence and the protein sequence, both of which were input into the BiGRU/BiGRU model. The SMILES character string added with the physicochemical property of the compound molecule is expressed by a compound sequence and is called SMILES #, and the protein sequence is encoded by the structural attribute of the protein. The BiGRU/BiGRU output is the compound feature vector and the protein feature vector represented by the edge-passed attention model. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a compound characteristic vector and a protein characteristic vector; the final output of the BiGRU/BiGRU-CNN model is the root mean square error value of the predicted compound protein affinity value.

Optionally, the bidirectional gating and circulating unit (BiGRU) model enables data to be input from both positive and negative directions, so that information at each moment includes sequence information of previous and subsequent moments, that is, sequence information of a network at a certain specific moment is increased, and information of historical data is fully utilized, thereby enabling prediction to be more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.

Alternatively, the edge Attention mechanism (Marginalized _ Attention) places Attention on the edges of the rows or columns of the compound-protein interaction specific matrix, and can solve the specific problem of the compound-protein interaction that cannot be solved by the Attention mechanism alone.

Optionally, the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.

Optionally, the method has as input 2 variables selected from the group consisting of protein structure attribute sequences from the UniRef database, and compound SMILES # from the STITCH database. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein. Wherein the compound SMILES # sequence is encoded by a SMILES string, a compound topological polar surface area, and a compound complexity.

Optionally, the BiGRU/BiGRU-CNN model with the edge attention mechanism is trained by using the existing affinity values of a large number of protein compounds, and perfect model parameters are obtained.

According to a second aspect of an implementation of the present invention, a computer device is provided.

In some optional embodiments, the computer device includes a memory, a graphics card, a central processing unit, and an executable program stored in the memory and capable of being processed by the central processing unit and the graphics card in parallel, wherein the memory is characterized in that the central processing unit executes the program to implement the following steps: constructing a target detection and target prediction model, wherein the target detection and target prediction model comprises the following steps: feature extraction networks and prediction networks. Firstly, a feature extraction network is utilized to extract features of an input compound SMILES # sequence and a protein structure attribute sequence; and (3) utilizing the extracted characteristic vector matrix to a target prediction model, wherein the target prediction model is to utilize convolution, pooling and full connection to operate the characteristic vector matrix and output root mean square error values of a predicted value and an actual value which are combined with affinity.

The time-space sequence in the field of medicine is intelligently processed by utilizing an artificial intelligence technology, so that the problems of high development cost, long time consumption, safety and the like of new medicines can be solved. The trend to be able to screen new drugs and therapeutic targets among old drugs and compounds abandoned in use, which have been determined to be safe, is changing the situation of drug development and creating a drug relocation model for new drug development.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

FIG. 1 is a specific flow diagram of a bidirectional GRU of the present invention

FIG. 2 is a system theme scheme diagram of the present invention

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. The scope of embodiments of the invention encompasses the full ambit of the claims, as well as all available equivalents of the claims. Embodiments may be referred to herein, individually or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed. Herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. The use of the phrase "including a" does not exclude the presence of other, identical elements in a process, method or device that includes the recited elements, unless expressly stated otherwise. The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. As for the methods, products and the like disclosed by the embodiments, the description is simple because the methods correspond to the method parts disclosed by the embodiments, and the related parts can be referred to the method parts for description.

FIG. 2 shows an alternative implementation architecture of the method for predicting the protein affinity of a compound based on the edge attention mechanism.

In this alternative example, the method includes a bidirectional gated cyclic unit (BiGRU) model and a Convolutional Neural Network (CNN) model, and the overall network architecture is BiGRU/BiGRU-CNN, wherein an edge Attention mechanism (Marginalized _ Attention) is added to the BiGRU/BiGRU model. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The inputs to the model were the compound sequence and the protein sequence, both of which were input into the BiGRU/BiGRU model. The SMILES character string added with the physicochemical property of the compound molecule is expressed by a compound sequence and is called SMILES #, and the protein sequence is encoded by the structural attribute of the protein. The BiGRU/BiGRU output is the compound feature vector and the protein feature vector represented by the edge attention model. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a compound characteristic vector and a protein characteristic vector; the final output of the BiGRU/BiGRU-CNN model is the root mean square error value of the predicted compound protein affinity value.

Optionally, the model further includes a training process of the bidirectional gated loop unit model, and a specific embodiment of the training process of the bidirectional gated loop unit model is provided below.

In this embodiment, in the training process of the target detection and target prediction model, first, the compound molecular sequence is input into one BiGRU model, the protein sequence is input into the other BiGRU model, and then the two BiGRU models are fused into the CNN model, so as to form training data, the number of units of the compound BiGRU model and the protein BiGRU model in the training process is respectively set to 128(cell) and 256(cell), then the two BiGRU models, i.e., the BiGRU/BiGRU model, are trained together with the CNN model, in order to reduce the complexity of the model, the BiGRU/BiGRU model is trained in advance to fix parameters, and then the two BiGRU models are trained together to determine the parameters of the CNN model. BiGRU/BiGRU can solve the specific problem of compound-protein interactions that cannot be solved by the Attention-alone mechanism, using the edge Attention mechanism (registered _ Attention), which places Attention at the edge of a row or column of a particular matrix of compound-protein interactions. The initial learning rate of the whole model training is 0.0001, a loss function (loss function) is set to be an average absolute error loss (MAE loss), network parameters are adjusted by using an Adam optimizer in the training process through calculating the error between a predicted value and a true value, the weight of the model parameters is adjusted, and then the loss function value is continuously reduced through continuous iteration, so that the network is finally converged.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as a memory, including instructions executable by a processor to perform steps of constructing a bi-directional gated loop unit (BiGRU) model including a sequential processing model of two gated loop units (GRUs), one input being a forward input and the other input being an inverse input, being an input-gate-only and forgetting-gate bi-directional recurrent neural network is also provided. The input of the model is compound characteristic representation and protein characteristic representation, wherein the compound characteristic representation is a SMILES character string added with the physicochemical properties of compound molecules, which is called SMILES #, and the physicochemical properties of the compound molecules comprise the topological polar surface area of the compound, the complexity of the compound and the like; the characteristic representation of a protein is encoded by the structural properties of the protein. The final output is a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU-CNN model predicts the mean square error value of the protein affinity value of the compound.

The non-transitory computer readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAMD), a magnetic tape, an optical storage device, and the like.

Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments disclosed herein, it should be understood that the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. The present invention is not limited to the procedures and structures that have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种壳聚糖水解酶可逆抑制保护剂的筛选方法

Method for predicting compound protein affinity based on edge attention mechanism, computer device and storage medium

相关技术

网友询问留言