Lactic acid bacteria antibacterial peptide prediction method based on graph neural network

文档序号:170922 发布日期:2021-10-29 浏览:36次 中文

阅读说明:本技术 一种基于图神经网络的乳酸菌抗菌肽预测方法 (Lactic acid bacteria antibacterial peptide prediction method based on graph neural network ) 是由 董改芳 孙志宏 翟冰 左永春 刘江平 扎木苏 于 2021-09-14 设计创作,主要内容包括:本发明公开了一种基于图神经网络的乳酸菌抗菌肽预测方法,通过搜索已知乳酸菌抗菌肽建立正样本,通过从蛋白质数据库中收集长度5-255的序列建立负样本,去冗余序列与相似;依据正负样本进行特征提取得到特征向量及初始输入图,在此基础上建立图神经网络模型;通过对图神经网络模型进行训练、评估与循环优化,确定图神经网络最佳层数、最佳训练轮数和学习率等参数;最后,依据图神经网络模型对疑似具有抗菌活性的菌株数据进行预测。本发明采用上述乳酸菌抗菌肽预测方法,以计算机模型预测代替实验室湿实验筛选,缩短乳酸菌抗菌肽类蛋白质序列的判断时长,实现准确高效批量识别,为具有抗菌特性的乳酸菌菌株筛选提供了有效替代方法。(The invention discloses a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network, which comprises the steps of searching known lactic acid bacteria antibacterial peptide to establish a positive sample, collecting a sequence with the length of 5-255 from a protein database to establish a negative sample, and removing redundant sequences and similarity; extracting features according to the positive and negative samples to obtain feature vectors and an initial input graph, and establishing a graph neural network model on the basis; parameters such as the optimal number of layers, the optimal training rounds, the learning rate and the like of the graph neural network are determined by training, evaluating and circularly optimizing the graph neural network model; and finally, predicting the suspected bacterial strain data with the antibacterial activity according to the graph neural network model. According to the invention, the prediction method of the lactobacillus antibacterial peptide is adopted, and the computer model prediction is used for replacing the laboratory wet experiment screening, so that the judgment time of the lactobacillus antibacterial peptide protein sequence is shortened, the accurate and efficient batch recognition is realized, and an effective replacement method is provided for the screening of lactobacillus strains with antibacterial characteristics.)

1. A lactic acid bacteria antibacterial peptide prediction method based on a graph neural network is characterized by comprising the following steps:

s1, collecting data, establishing a positive sample and a negative sample, wherein the positive sample is a lactic acid bacteria antibacterial peptide sequence set separated from known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which meets specific length and has similarity lower than 80% in a protein database, and establishing a sample set according to the positive sample and the negative sample;

s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the graph neural network by the nodes and the edges together; establishing a feature vector of a word segmentation by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;

s3, constructing a graph neural network model, calculating an adjacency matrix of an initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the input feature vector;

s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to the loss value, recalculating the loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;

s5, evaluating and optimizing the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training model until the optimal parameter combination which achieves the highest accuracy of the graph neural network model and other relatively optimal evaluation indexes is found;

and S6, identifying strains, performing protein sequencing on the suspected lactobacillus strains in batches by adopting the model, and then screening and identifying whether the suspected lactobacillus strains have antibacterial activity.

2. The method for predicting lactic acid bacteria antimicrobial peptides based on the graph neural network according to claim 1, wherein: the word embedding technique in the step S2 includes, but is not limited to, Bert, FastText, ELMo.

3. The lactic acid bacteria antimicrobial peptide prediction method based on graph neural network as claimed in claim 1, wherein the evaluation index in step S5 includes but is not limited to sensitivity, specificity, accuracy, mahi correlation coefficient.

4. The method for predicting lactobacillus antimicrobial peptides based on neural network of claim 1, wherein the specific process of step S5 is as follows:

s51, fixing the number of model layers to 2, and the learning rate to 0.001, sequentially changing the number of model training rounds from 50 to 500 by using the step length as 10, drawing an evaluation index change curve, and finding the best number of model training rounds at this time;

s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;

s53, gradually changing the number of model layers from 3 to 6, and repeating the process;

and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.

Technical Field

The invention relates to the field of identification of biological antibacterial peptides, in particular to a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network.

Background

In the existing recognition technology of biological antibacterial peptide, the following two technologies are mainly adopted:

firstly, an agar hole diffusion method is adopted for bacteriostasis experiments, the consumed time is long, and high-throughput identification cannot be realized; and secondly, the recognition is carried out by adopting a machine learning technology or a long-short term memory and convolutional neural network technology in deep learning, although a plurality of amino acid sequences can be processed at one time, only the local semantic information of the antibacterial peptide sequence can be captured, and the characteristic information of the antibacterial peptide is not easy to grasp from the perspective of the overall structure, so that the recognition accuracy and other indexes need to be improved.

Disclosure of Invention

In order to solve the problems and realize accurate identification and high-flux identification of the antibacterial peptide, the invention provides the following technical scheme:

a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network comprises the following steps:

s1, collecting data, establishing a positive sample and a negative sample, wherein the positive sample is a lactobacillus antibacterial peptide sequence set separated from more than 20 known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which is in the international protein database (such as Uniprot) and has the length of 5-255 and the similarity of less than 80%, and establishing a sample set according to the positive sample and the negative sample;

s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the neural network by the nodes and the edges together; establishing a feature vector of a word segmentation by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;

s3, constructing a graph neural network model, calculating an adjacency matrix of an initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the input feature vector;

s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to a loss value and an optimization function, recalculating loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;

s5, evaluating and optimizing the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the model layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training of the model until the optimal parameter combination which achieves the highest accuracy of the graph neural network model and other relatively optimal evaluation indexes is found;

and S6, identifying strains, performing protein sequencing on the suspected lactobacillus strains in batches by adopting the model, and then screening and identifying whether the suspected lactobacillus strains have antibacterial activity.

Preferably, the word embedding technique in step S2 includes, but is not limited to, Bert, FastText, ELMo.

Preferably, the evaluation indexes in step S4 include, but are not limited to, sensitivity, specificity, accuracy, and manikin correlation coefficient.

Preferably, the specific process of step S5 is as follows:

s51, fixing the number of model layers to 2, and the learning rate to 0.001, sequentially changing the number of model training rounds from 50 to 500 by using the step length as 10, drawing an evaluation index change curve, and finding the best number of model training rounds at this time;

s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;

s53, gradually changing the number of model layers from 3 to 6, and repeating the process;

and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.

By adopting the prediction method and the graph neural network technology, the amino acid conserved structure of the antibacterial peptide sequence is expressed as the nodes on the graph, the co-occurrence relation between the conserved structures is expressed as the edges in the graph, and the recognition problem of the antibacterial peptide is ingeniously converted into the classification problem of the nodes on the graph. Because the graph structure is an integral structure, the structure can capture and mine the characteristic information of the antibacterial peptide sequence from the integral angle, thereby realizing the accurate classification of the nodes in the graph. Compared with the prior art, the identification accuracy index is greatly improved, and batch identification is realized.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a diagram illustrating an implementation of data collection in an embodiment of the present invention;

FIG. 3 is a partial data of a positive sample according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the embodiment.

The lactobacillus antimicrobial peptide prediction method based on the graph neural network is mainly divided into four aspects of data acquisition, model establishment, model optimization and model prediction.

In particular, it can be subdivided into the following steps:

s1, collecting data, and establishing a positive sample and a negative sample

The positive sample is a lactobacillus antibacterial peptide sequence set separated from a comprehensive and special antibacterial peptide database obtained by investigation, the negative sample is a protein sequence set meeting the length requirement of 5-255, and a sample set is established according to the positive sample and the negative sample.

As shown in fig. 2, lactic acid bacteria antimicrobial peptides are separated from antimicrobial peptide databases such as APD3, ADAM, DRAMP and the like, and a positive sample is established; protein sequences with the sequence length of 5-255 are separated from public databases such as PDB, UniProt and the like, and negative samples are established. Both positive and negative examples require the use of CD-HIT, CD-HIT-2D software to remove redundant sequences and sequences with similarity greater than 80%, and then combine them into a sample set. The model was evaluated using a 10-fold cross-validation method.

S2, preprocessing data

The method is characterized in that the length range of a lactobacillus antibacterial peptide data sequence and the proportion distribution of amino acids are statistically analyzed, various Chinese word segmentation technologies processed by natural languages are researched, amino acid conservative structure combination can be determined by adopting methods of multi-sequence comparison, single amino acid, dipeptide and the like, and a word segmentation scheme is determined by integrating the information.

And vectorizing the words by using word embedding technologies such as Bert, FastText, ELMo and the like to form feature vectors of the words, wherein the feature vectors are used as input feature vectors of the graph neural network. And establishing nodes according to the words of the peptide sequences and the peptide sequences, establishing edges according to the co-occurrence relation of the words and the affiliated relation of the words and the sequences, and forming an initial input graph of the neural network by the nodes and the edges together.

The term is used herein to refer to a domain that may be conserved in a protein sequence. There are 20 kinds of amino acids (see the one-letter abbreviation table of amino acids for details) in nature, a plurality of amino acids form a peptide chain, and one or more peptide chains may form a protein. The word can be formed by a single amino acid or a group of two amino acids, and can also be formed by possible conserved sequences in the antibacterial peptide sequence structure, and the word and the sequence are used as nodes of a graph neural network and establish the relationship between the nodes, so that the graph neural network model can be adopted for identification processing.

S3 construction of graph neural network model

And calculating an adjacency matrix of the initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the feature vector.

A multi-layer graph convolutional neural network may be constructed according to equation (1).

Z(A,X)=softmax(A'…(ReLU(A'XW0))…Wn) (1)

Where A is the adjacency matrix, X is the eigenvector, ReLU is the activation function, W0、WnThe number of the weight matrixes is determined according to the number of layers of the graph convolution neural network.

A' is obtained by subjecting A to Laplace transform (2).

D is a degree matrix of the graph, I is an identity matrix, and a calculation formula of D is shown in (3).

S4 training of graph neural network model

Calculating a loss value through a cross entropy loss function, and adjusting a weight matrix W through an Adam optimizer according to the loss value1To WnAnd recalculating the loss value by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum value.

S5, evaluation and tuning of graph neural network model

(1) Evaluation of

And evaluating the neural network model according to the evaluation index, and verifying the accuracy of the neural network model. The evaluation indexes comprise sensitivity, specificity, accuracy and horse repairing correlation coefficient.

The scheme is evaluated according to four indexes.

Sensitivity (SN) represents the proportion of all antimicrobial peptides that are correctly predicted; specificity (SP) indicates the proportion of all non-antibacterial peptides that are correctly predicted; accuracy (ACC) represents the proportion of all samples that are correctly predicted. Since this index is considered to be the most important index among the evaluation indexes, it can be considered as an index by which the model expresses the effect of the prediction model; the Mathew's Correlation Coefficient (MCC) is used to evaluate the classification performance, and it is a statistical method to measure the Correlation between the predicted result and the actual result.

True Positive (TP) indicates the number of antimicrobial peptides predicted to be antimicrobial peptides; true Negative (TN) indicates the number of non-antibacterial peptides predicted to be non-antibacterial peptides; false Positive (FP) indicates the number of antimicrobial peptides predicted to be non-antimicrobial peptides; false Negative (FN) indicates the number of non-antibacterial peptides predicted to be antibacterial peptides.

(2) Adjusting and optimizing

Parameters such as the number of model layers, the number of rounds of epochs, the Learning Rate and the like are optimized through the following steps.

S51, constructing experience according to the deep learning model, wherein the number of model layers is fixed to 2, the learning rate is 0.001, model training turns are sequentially changed by taking the step length as 10 from 50 to 500, an evaluation index change curve is drawn, and the best model training turn is found;

s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;

s53, gradually changing the number of model layers from 3 to 6, and repeating the process;

and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.

S6, strain identification

And screening and identifying a suspected lactic acid bacteria strain sequence with antibacterial activity after protein sequencing by adopting the model. Of course, the model can be loaded into an intelligent device in various forms such as APP, a client, an H5 applet and Web, so that screening and identification of the non-determined strains can be facilitated at any time.

The above is a specific embodiment of the present invention, but the scope of the present invention should not be limited thereto. Any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention, and therefore, the protection scope of the present invention is subject to the protection scope defined by the appended claims.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于骨干粒子群算法的基因数据特征选择方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!