DNA binding residue prediction method based on deep convolutional neural network

文档序号:1075098 发布日期:2020-10-16 浏览:8次 中文

阅读说明:本技术 一种基于深度卷积神经网络的dna绑定残基预测方法 (DNA binding residue prediction method based on deep convolutional neural network ) 是由 胡俊 白岩松 樊学强 郑琳琳 张贵军 于 2020-06-12 设计创作,主要内容包括:一种基于深度卷积神经网络的DNA绑定残基预测方法,首先,根据输入的残基数为L待进行配体绑定残基预测的蛋白质序列信息,使用psi-blast程序和PSSpred程序获取矩阵PSSM和PSS;然后,将两个矩阵组合为一个特征矩阵F;其次,我们将蛋白质序列处理成残基样本;再次,搭建深度卷积神经网络,利用已知绑定残基的蛋白质序列构建数据集,并将数据集划分为M组数据子集,利用这十组数据子集训练出M个网络模型;最后,将待进行预测的蛋白质序列处理成残基样本,并输入到被训练过的M个网络模型中,综合这M个模型的预测结果,预测蛋白质序列中的残基是否为绑定残基。本发明计算代价小、预测精度高。(A DNA binding residue prediction method based on a deep convolutional neural network comprises the steps of firstly, obtaining matrixes PSSM and PSS by using a psi-blast program and a PSSpred program according to input protein sequence information with the residue number L to be subjected to ligand binding residue prediction; then, combining the two matrixes into a characteristic matrix F; secondly, we processed the protein sequence into residue samples; thirdly, building a deep convolutional neural network, building a data set by utilizing the protein sequence of the known binding residues, dividing the data set into M groups of data subsets, and training M network models by utilizing the ten groups of data subsets; and finally, processing the protein sequence to be predicted into residue samples, inputting the residue samples into the M trained network models, and predicting whether residues in the protein sequence are binding residues or not by integrating the prediction results of the M models. The method has the advantages of low calculation cost and high prediction precision.)

1. A DNA binding residue prediction method based on a deep convolutional neural network is characterized by comprising the following steps:

1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;

2) for a protein sequence S, searching a protein sequence database swissprot by using a psi-blast program to generate a position specificity scoring matrix with the size of L multiplied by 20, and recording the position specificity scoring matrix as PSSM;

3) for a protein sequence S, searching a protein sequence database nr by using a PSSpred program to generate a protein secondary structure matrix with the size of L multiplied by 3, and recording the protein secondary structure matrix as PSS;

4) combining the two-dimensional matrixes obtained in the steps 3) and 4) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, taking the 8 rows of data adjacent to the front and the back as a feature matrix of the residue, taking the residue as a sample, and taking the number of residue samples of the protein sequence S as L;

6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using ten groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain M deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:

Figure FDA0002536258320000011

u represents the true tag of the residue to be determined in the protein sequence,

Figure FDA0002536258320000012

9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.

Technical Field

The invention relates to the fields of bioinformatics, pattern recognition and computer application, in particular to a DNA binding residue prediction method based on a deep convolutional neural network.

Background

Protein-ligand interactions are ubiquitous and indispensable in life processes, and play a very important role in recognition and signaling of biomolecules. The DNA molecule belongs to one of ligand molecules, accurately identifies the binding residue of the DNA molecule in a protein sequence, is beneficial to understanding the function of the protein, analyzing the interaction mechanism between the protein and the DNA molecule and designing a drug target protein, and has important biological significance.

Investigations have found that many methods for predicting DNA binding residues in protein sequences have been proposed, such as: DISPLAR (Tjong H, Zhou H. an acid method for predicting DNA-binding proteins on surfaces [ J ]. Nucleic Acids Research,2007,35(5):1465-1477. Tjong H et al. a method for accurately predicting DNA binding residues on protein surfaces [ J ]. Nucleic Acids Research,2007,35 (5):1465-1477), DELIA (Xia C, Pan X, Shen H, et al. protein-binding specificity on protein surface [ J ]. Nucleic Acids Research,2007,35 (5):1465-1477), DELIA (Xia C, Pan X, Shen H, et al. protein-binding specificity on binding specificity through protein binding specificity [ J ]. Bioinformatics, Xia C et al. improving protein binding specificity and binding specificity by depth of sequence and structure data [ J ]. convolution protein C et al. learning of binding specificity by depth of sequence and structure data [ J ]. prediction of protein binding specificity [ H, protein J ]. 12. prediction of protein binding specificity on the basis of DNA-binding specificity of protein binding specificity [ J ]. protein J ]. prediction of protein binding specificity, protein J ]. 1. prediction Bioinformatics,2016,32 (12)), ENSEMBLE-CNN (Zhang Y, Qiao S, Ji S, et al.predicting DNA Binding Sites in proteins Sequences by an Enconsequently derived Learning Method [ C ]. international conference on interaction computing,2018: 301. 306. namely: zhang Y et al, predicting DNA binding sites [ C ] in protein sequences by integrated deep learning methods, International Intelligent computing conference, 2018: 301-. Although the existing method can be used for predicting DNA binding residues in a protein sequence, a large amount of experimental data and a machine learning algorithm are generally used, so that the cost is high, and meanwhile, because noise information in a training set is not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and needs to be further improved.

In conclusion, the existing prediction method of the DNA binding residues has a great gap from the requirement of practical application in the aspects of calculation cost and prediction precision, and needs to be improved urgently.

Disclosure of Invention

In order to overcome the defects of the existing DNA binding residue prediction method in two aspects of calculation cost and prediction precision, the invention provides a DNA binding residue prediction method based on a deep convolutional neural network, which is low in calculation cost and high in prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for DNA-binding residue prediction based on deep convolutional neural network, the method comprising the steps of:

1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;

2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a location-specific scoring matrix of size L × 20, denoted PSSM;

3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of L multiplied by 3, and the protein secondary structure matrix is marked as PSS;

4) combining the two-dimensional matrixes obtained in the steps 3) and 4) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, taking the 8 rows of data adjacent to the front and the back as a feature matrix of the residue, taking the residue as a sample, and taking the number of residue samples of the protein sequence S as L;

6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using ten groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain M deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:

Figure BDA0002536258330000031

u represents the true tag of the residue to be determined in the protein sequence,the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;

9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.

The technical conception of the invention is as follows: firstly, obtaining matrixes PSSM and PSS by using a psi-blast program and a PSSpred program according to protein sequence information with input residue number L and to-be-subjected ligand binding residue prediction; then, combining the two matrixes into a characteristic matrix F; secondly, we processed the protein sequence into residue samples; thirdly, building a deep convolutional neural network, building a data set by utilizing the protein sequence of the known binding residues, dividing the data set into ten groups of data subsets, and training ten network models by utilizing the ten groups of data subsets; and finally, processing the protein sequence to be predicted into residue samples, inputting the residue samples into ten trained network models, and predicting whether residues in the protein sequence are binding residues or not by integrating the prediction results of the ten models.

The beneficial effects of the invention are as follows: on one hand, starting from a characteristic matrix of sequence information, a protein sequence is processed into a residue sample, and a deep convolution network model is built, so that preparation is made for improving prediction accuracy; on the other hand, ten data subsets are constructed and used for training ten network models, and the prediction results of the ten network models are integrated, so that the prediction efficiency and accuracy of the DNA binding residues are further improved.

Drawings

FIG. 1 is a schematic diagram of a deep convolutional neural network-based DNA binding residue prediction method.

FIG. 2 shows the result of DNA binding residue prediction of protein sequence 1X3C using a deep convolutional neural network-based prediction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:

1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;

2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a location-specific scoring matrix of size L × 20, denoted PSSM;

3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of L multiplied by 3, and the protein secondary structure matrix is marked as PSS;

4) combining the two-dimensional matrixes obtained in the steps 3) and 4) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, taking the 8 rows of data adjacent to the front and the back as a feature matrix of the residue, taking the residue as a sample, and taking the residue sample of the protein sequence S as L;

6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M (taking M as 10) groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using ten groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain M deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:

u represents the true tag of the residue to be determined in the protein sequence,

Figure BDA0002536258330000042

the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;

9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.

In this embodiment, the DNA binding residue prediction of the protein sequence 1X3C is taken as an example, and a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:

1) inputting a protein 1X3C with 73 residues to be subjected to DNA binding residue prediction, and recording the protein as S;

2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a position-specific scoring matrix with a size of 73X 20, denoted PSSM;

3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of 73 x3, and the protein secondary structure matrix is marked as PSS;

4) combining the two-dimensional matrixes obtained in the steps 3) and 4) into a characteristic matrix of 73 multiplied by 23, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending at the 64 th row of F, using the residue corresponding to the middle row as a prediction target, using the 8 rows of data adjacent to the front and back as a feature matrix of the residue, using the residue as a sample, and using the residue sample of the protein sequence S as 73;

6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into ten groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using ten groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain ten deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:

Figure BDA0002536258330000051

u represents the true tag of the residue to be determined in the protein sequence,

Figure BDA0002536258330000052

the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;

9) inputting residue samples generated by a protein sequence S into ten models obtained in step 8), setting an output probability threshold value as threshold for each model, and when the position of the output value greater than the threshold is a binding residue predicted by the model, predicting each residue sample in the S through the ten models to generate ten prediction results, wherein most prediction conditions in the ten prediction results are final prediction results.

The above description is the prediction result obtained by the present invention using the prediction of DNA binding residues of protein sequence 1X3C as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种结肠腺癌基因组变异与肿瘤进化关系的研究方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!