Key protein identification method based on capsule neural network and ensemble learning

文档序号：1273790 发布日期：2020-08-25 浏览：17次中文

阅读说明：本技术 一种基于胶囊神经网络和集成学习的关键蛋白质识别方法 (Key protein identification method based on capsule neural network and ensemble learning ) 是由彭玮李霞戴伟于 2020-04-01 设计创作，主要内容包括：本发明公开了一种基于胶囊神经网络和集成学习的关键蛋白质识别方法,包括：步骤1：利用Cytoscape工具提取蛋白质在蛋白质相互作用网络中的八种生物学特征；步骤2：使用胶囊神经网络提取八种生物学特征的更深层的增强特征；步骤3：将生物学特征和蛋白质增强特征进行连接；步骤4：将步骤3得到的连接后的特征放入到集成模型Multi-ensemble中,对模型进行训练,并利用训练好的集成模型预测新的关键蛋白质；步骤5：输出结果。本发明通过胶囊神经网络提取的增强特征比初始的生物学特征更能提高一些机器学习模型预测关键蛋白质的准确性。并且通过融合初始生物学特征和增强特征能进一步提高机器学习模型预测关键蛋白质的准确性。(The invention discloses a key protein identification method based on a capsule neural network and ensemble learning, which comprises the following steps: step 1: extracting eight biological characteristics of the protein in a protein interaction network by using a Cytoscape tool; step 2: extracting deeper enhancement features of the eight biological features using the capsule neural network; and step 3: linking the biological feature and the protein enhancing feature; and 4, step 4: putting the connected features obtained in the step 3 into an integrated model Multi-ensemble, training the model, and predicting new key protein by using the trained integrated model; and 5: and outputting the result. The invention can improve the accuracy of predicting key protein by some machine learning models through the enhanced features extracted by the capsule neural network compared with the initial biological features. And the accuracy of predicting key proteins by the machine learning model can be further improved by fusing the initial biological characteristics and the enhanced characteristics.)

1. A key protein identification method based on capsule neural network and ensemble learning is characterized in that: the method comprises the following steps:

step 1: extracting eight biological characteristics of the protein in a protein interaction network by using a Cytoscape tool; wherein, the proteins are divided into non-key proteins and key proteins;

step 2: extracting deeper enhancement features of the eight biological features by using the capsule neural network, and selecting a second row of a matrix obtained from a final layer of the capsule neural network as the enhancement features of the protein; wherein, the capsule neural network convolution layer is provided with 32 1 multiplied by 2 convolution kernels, the step length is 1, and the activation function selects a RELU function; the capsule layer of the capsule neural network selects 32 convolution 8-dimensional capsule channels, and the capsule neural network adopts a nonlinear activation function from the capsule layer to the final layer to execute a dynamic routing process;

and step 3: connecting the initial biological characteristics obtained in the step 1 with the protein enhancement characteristics obtained in the step 2;

and 4, step 4: putting the connected features obtained in the step 3 into an integrated model Multi-ensemble, training the model, and predicting new key protein by using the trained integrated model;

and 5: and outputting a result: and (4) sequencing the proteins in a descending order according to the score obtained by the integrated model Multi-ensemble, and outputting a sequencing result.

2. The capsule neural network and ensemble learning-based key protein identification method of claim 1, wherein: the nonlinear activation function in the step 2 is a squaring function.

3. The base of claim 1The key protein identification method based on the capsule neural network and ensemble learning is characterized in that: the step 4 of adopting an integrated model Multi-ensemble comprises data division, sample selection and an integrated weak classifier; wherein the data partitioning step divides the partitioned training set into a data set P and a data set R, and performs a back-sampling on the data set P to generate m different data sets { P }₁,P₂…P_mThe weak classifiers are used as an initial training set of the m weak classifiers; the data set R is divided into n mutually exclusive subsets R₁,R₂…R_nAs a test set for an iterative process, in each iteration, if most other weak classifiers will be R_jIf the samples in the weak classifier are regarded as high-quality samples, adding the high-quality samples into a training set of the next iteration of the weak classifier; wherein most of the other classifiers refer to the number of other classifiers considering the sample as a high quality sample, which is two thirds of the total number of weak classifiers, j is 1,2, …, n.

Technical Field

The invention relates to a key protein identification method based on a capsule neural network and ensemble learning, and belongs to the field of system biology.

Background

The vital activities of an organism often require deep involvement of proteins. The key protein refers to a protein which can cause the loss of function of a related protein complex and cell death after being removed by a knockout mutation. Key proteins are an essential part of the vital activities of cells. Therefore, how to accurately predict key proteins becomes a focus of research in the field of proteomics.

In the early studies of key proteins, biologists mainly examined the effect of organisms on them when they lost some proteins by biological experiments, and thus judged whether the proteins were key proteins. Although good results are obtained, the method has the limitations of long time consumption, high cost and the like. To this end, some researchers have worked on computer thinking to solve such problems, coupled with the rapid development of high-throughput proteomic technologies and the increasing sophistication of protein interaction data, which has made it possible to identify key proteins using computational methods. Jeong et al propose a "central-lethal" rule that calls the degrees in the protein network structure, i.e., nodes with more adjacent protein nodes, as hub points, which are usually located at the center of the network and have a significant impact on the topology of the entire network. Whereas the deletion of hub points can be devastating for the entire network, this also suggests to some extent that the deletion of hub points, like the deletion of key proteins, can have a dramatic effect on biological activity. Based on this "central-lethal" rule and protein interaction data, a set of central metrics based on protein interaction networks was derived to measure the properties of proteins in the network to identify key proteins. These centralities include Degree Centrality (DC) of the nodes. The centrality refers to the number of domains of a certain node in the network, and the method is simple and easy to use, but the predicted number of key proteins is small. Node Betweenness Centrality (BC) refers to the number of shortest paths that a node appears between other nodes, and reflects the pivot degree of the node position, but the computation complexity is high. The close proximity of a node (CC) considers the degree of dependence of the node on the information propagation of other nodes. Subgraph Centrality (SC) of nodes measures the criticality of a node of a protein by the total number of closed loops formed by a node and other nodes in a network. The Eigenvector Centrality (EC) of a node is the criticality of the corresponding protein node measured by the component of each vertex in the principal vector of the network adjacency matrix. The Information Center (IC) of a node is the criticality of each protein node as measured by the average sum of paths with each vertex as an end point. These centrality measures take into account the topological properties of proteins in protein interaction networks, but ignore the biological properties of the proteins themselves.

To better predict key proteins, Li and Tang et al, in conjunction with protein interaction networks and gene expression information, proposed a key protein prediction method named PeC and WDC. Peng et al proposed an ION approach in conjunction with protein interaction networks and protein homology information. Meanwhile, some studies adopt a supervised learning method and use machine learning algorithms such as SVM, decision tree, naive Bayes and the like to predict key proteins. Gustafson et al performed key protein prediction by combining genomic and protein features with different prediction capabilities and using naive Bayes. Hwang et al constructed SVM classifiers to predict key proteins based on biological features such as open reading frames and protein conservation and features of proteins such as DC, BD and CC in a protein interaction network. Zhong et al proposed a key protein prediction method based on GEP by integrating the protein's features in the protein interaction network (DC, BC, CC, EC, IC, SC, NC) and the calculated features of the bound biological properties (PeC, WDC and ION). Meanwhile, some ensemble learning algorithms are applied to identify key proteins, and Deng et al [2] integrate a naive Bayes classifier, a C4.5 decision tree, a CN2 rule and a logistic regression model to predict the key proteins. Chen et al [3] integrate Support Vector Machines (SVM) and ANN to predict key proteins. Zhong et al [4] fused multiple XGboost classifiers to predict key proteins. Although the above method fuses biological characteristics of some proteins and also uses an ensemble learning algorithm to identify key proteins, the used features are still too few, and deep features are not mined. In addition, the integration algorithm only weights and averages the results of a few weak classifiers to obtain the final output.

There is therefore a need to develop more efficient feature extraction methods and more efficient ensemble learning methods to improve the predictive performance of key proteins.

Disclosure of Invention

The invention provides a key protein identification method based on a capsule neural network and ensemble learning, which can effectively extract deep features of key proteins based on the capsule neural network and can effectively improve the accuracy and sensitivity of key protein identification by combining a Multi-ensemble model.

The technical scheme of the invention is as follows: a key protein identification method based on capsule neural network and ensemble learning comprises the following steps:

and step 3: connecting the initial biological characteristics obtained in the step 1 with the protein enhancement characteristics obtained in the step 2;

and 4, step 4: putting the connected features obtained in the step 3 into an integrated model Multi-ensemble, training the model, and predicting new key protein by using the trained integrated model;

and 5: and outputting a result: and (4) sequencing the proteins in a descending order according to the score obtained by the integrated model Multi-ensemble, and outputting a sequencing result.

The nonlinear activation function in the step 2 is a squaring function.

The step 4 of adopting an integrated model Multi-ensemble comprises data division, sample selection and an integrated weak classifier; wherein the data partitioning step divides the partitioned training set into a data set P and a data set R, and performs a back-sampling on the data set P to generate m different data sets { P }₁,P₂…P_mThe weak classifiers are used as an initial training set of the m weak classifiers; the data set R is divided into n mutually exclusive subsets R₁,R₂…R_nAs a test set for an iterative process, in each iteration, if most other weak classifiers will be R_jIf the samples in the weak classifier are regarded as high-quality samples, adding the high-quality samples into a training set of the next iteration of the weak classifier; wherein most of the other classifiers refer to the number of other classifiers considering the sample as a high quality sample, which is two thirds of the total number of weak classifiers, j is 1,2, …, n.

The invention has the beneficial effects that: the invention uses the initial biological characteristics of the protein in the protein interaction network, extracts the enhanced characteristics of the biological characteristics through the capsule neural network, further combines the initial biological characteristics and the enhanced characteristics, and uses an effective integration method to predict the key protein on the basis. The experimental result of the method shows that compared with the previous method for predicting the key protein based on machine learning and ensemble learning, the method provided by the invention can improve the accuracy of identifying the key protein, and can provide valuable reference information for the experiment and further research of the key protein identification of biologists. The enhanced features extracted by the capsule neural network can improve the accuracy of predicting key proteins by some machine learning models compared with the original biological features. And the accuracy of predicting key proteins by the machine learning model can be further improved by fusing the initial biological characteristics and the enhanced characteristics.

Drawings

FIG. 1 is a flow diagram of the method of the present invention, CapsME;

FIG. 2 is a structural diagram of an integrated model Multi-ensemble in CapsME according to the method of the present invention.

Detailed Description

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：面向基因比对的细粒度并行负载特征抽取分析方法及系统

Key protein identification method based on capsule neural network and ensemble learning

相关技术

网友询问留言