Deep neural network-based cerebrospinal fluid protein prediction method

文档序号：1536704 发布日期：2020-02-14 浏览：35次中文

阅读说明：本技术 基于深层神经网络的脑脊液蛋白质的预测方法 (Deep neural network-based cerebrospinal fluid protein prediction method ) 是由邵丹王岩黄岚何凯崔薛腾张双全于 2019-11-06 设计创作，主要内容包括：基于深度神经网络的脑脊液蛋白质的预测方法属于人工智能与大数据技术领域。本发明将现有文献和数据库的脑脊液中已经被生物实验验证的蛋白质列表作为模型训练的正样本；在Pfam蛋白质家族信息数据库中删除正样本对应的蛋白质家族信息,在剩余的蛋白质家族信息数据库中查找家族中蛋白质数量超过10个的蛋白质家族,从这些蛋白质家族中随机选取10个蛋白质信息作为模型训练的负样本。将正样本和负样本数据分成训练集、验证集和测试集。对蛋白质特征进行特征选择,搭建模型,用训练集训练模型,验证集进行调参,测试集进行性能评价。输入为蛋白特征,输出为预测结果。提高了脑脊液预测的准确率,最终实现脑脊液蛋白的预测。(A deep neural network-based cerebrospinal fluid protein prediction method belongs to the technical field of artificial intelligence and big data. The invention takes a protein list which is verified by biological experiments in cerebrospinal fluid of the existing literature and database as a positive sample of model training; deleting the protein family information corresponding to the positive sample from the Pfam protein family information database, searching the protein families with more than 10 proteins in the families from the rest protein family information database, and randomly selecting 10 protein information from the protein families as the negative sample of model training. The positive and negative sample data are divided into a training set, a validation set, and a test set. And (3) carrying out feature selection on the protein features, building a model, training the model by using a training set, carrying out parameter adjustment by using a verification set, and carrying out performance evaluation by using a test set. The input is protein characteristics, and the output is a prediction result. The accuracy of cerebrospinal fluid prediction is improved, and the prediction of cerebrospinal fluid protein is finally realized.)

1. The deep neural network-based cerebrospinal fluid protein prediction method is characterized by comprising the following steps of: comprising the following steps, and sequentially proceeding with the following steps,

firstly, taking protein which is verified by a biological experiment in cerebrospinal fluid as a positive sample for model training and storing protein information data of the positive sample;

deleting the protein family information corresponding to the positive sample in the step one from a Pfam protein family information database, searching protein families with more than 10 proteins in the families from the rest protein family information database, randomly selecting 10 protein information from the protein families as negative samples for model training, and storing the protein information data of the negative samples;

thirdly, dividing the positive sample information data and the negative sample information data according to an 80% training set, a 10% verification set and a 10% testing set;

classifying the protein features to obtain a protein feature vector preliminarily;

filtering the protein characteristic vector preliminarily obtained in the fourth step by using a t test method, and performing characteristic selection on the filtered protein characteristic vector by using a support vector machine combined characteristic removal algorithm (SVM-RFE) to obtain a protein characteristic vector for model fitting training;

step six, establishing a classifier model through a deep neural network, wherein the input of the classifier model is the feature vector obtained in the step five, and the output of the classifier model is cerebrospinal protein or non-cerebrospinal protein;

step seven, fitting a classifier model by using a training set and adopting a linear rectification function ReLU activation function and a loss function of cross entropy to obtain a trained classifier model;

step eight, evaluating the classifier model

Inputting the protein information of the positive samples and the negative samples in the verification set and the feature vector obtained in the fifth step into the classifier model trained in the seventh step for verification, and using Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), accuracy (Precision), Markov correlation coefficient MCC (Matthewscoreference) and area AUC (area Under Roc Current) Under the ROC curve as evaluation indexes for evaluating the model verification effect,

the obtained AUC is less than 90%, and the seven steps are repeated to fit and train the classifier model until the AUC reaches more than 90%;

step nine, carrying out classification accuracy verification on the classifier model verified in the step eight by using a test set, wherein the classification accuracy is less than 90%, repeating the step seven and the step eight until the classification accuracy reaches more than 90%, and finishing the establishment of the classifier model;

step ten, inputting the feature vector of the predicted protein into the classifier model established in the step nine, and realizing the prediction of the cerebrospinal fluid protein based on the deep neural network through the output prediction result.

2. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: the protein characteristics in the fourth step are classified according to 4 major categories, which are respectively as follows: 1) sequence properties, 2) structural properties, 3) domain and motif properties, 4) physicochemical properties.

3. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: and in the step five, a significant level threshold value p-value < 0.005 is adopted in the t test method.

4. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: in the fifth step, a selection judgment function DJ (i) of the support vector machine combined feature removal algorithm (SVM-RFE) is defined as follows:

y_iis a sample x_iLabel of (a), y_jIs a sample x_jLabel of (2), K (x)_i,x_j) Is a test x_iAnd x_jThe kernel function of similarity, α, is the value obtained after training by SVM, T represents the transpose of the matrix, and H represents the matrix.

5. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: in the sixth step, the deep neural network is defined as follows:

Y＝W·X+b

where Y represents the output of the hidden layer, X represents the input value of the hidden layer, W represents the connection weight between the hidden layer and the output of the previous layer, and b represents the bias term of the fully connected layer.

6. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: the structure of the deep neural network in the sixth step comprises an input layer, a hidden layer and an output layer; the hidden layer has 4 layers, the number of neurons of the hidden layer is 500, and an activation function used by the hidden layer is ReLU; the number of neurons in the output layer is 1, and the activation function used by the output layer is Sigmoid.

7. The deep neural network-based cerebrospinal fluid protein prediction method of claim 6, which comprises: the definitions of the activation functions ReLU and Sigmoid are as follows:

ReLU(z)＝max(0,z)

where z is the sum of the weights of the neurons, max is a maximum function, and e is a natural constant.

8. The deep neural network-based cerebrospinal fluid protein prediction method of claim 6, which comprises: the structure of the deep neural network is defined as follows:

where Hidden represents the Hidden layer and Out represents the output layer.

9. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: the loss function of cross entropy is a binary cross entropy (binary cross entropy) defined as follows:

wherein y is_iThe true category of the ith sample is represented,and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

10. The deep neural network-based cerebrospinal fluid protein prediction method of claim 1, which comprises: in the step eight, the Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), Precision (Precision), Markenss correlation coefficient MCC (Matthews correlation coefficient) and the area Under the ROC curve (area Under Roc Current) are respectively given as:

wherein TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positive samples, FN represents the number of false negative samples, and N represents the number of all training samples.

Technical Field

The invention belongs to the technical field of big data and artificial intelligence, and particularly relates to a deep neural network-based cerebrospinal fluid protein prediction method.

Background

Cerebrospinal fluid is a colorless and transparent liquid produced by the choroid plexus in the ventricle of the brain, flows on the surface of the brain and spinal cord in a circulating manner, is associated with systemic circulation through the intracerebral venous system, has the main functions of ① protecting the brain and spinal cord from external shock injury, ② regulating intracranial pressure changes, ③ supplying nutrients to the brain and spinal cord and transporting away metabolites, ④ regulating the alkali reserve of the nervous system, maintaining normal pH value and the like.

When brain tissue or spinal cord is diseased and traumatized, various changes in cerebrospinal fluid may occur. By predicting proteins in cerebrospinal fluid, the early diagnosis of diseases such as neurodegenerative diseases, multiple sclerosis and traumatic brain injury can be promoted by finding disease-related proteins for pathological analysis. However, currently, there is still a gap in predicting cerebrospinal fluid proteins with respect to the well-known calculable methods.

Therefore, there is a need in the art for a new solution to solve this problem.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the technical problem that the currently known method for predicting the cerebrospinal fluid protein is blank is solved by providing a deep neural network-based cerebrospinal fluid protein prediction method.

The deep neural network-based cerebrospinal fluid protein prediction method comprises the following steps which are sequentially carried out,

firstly, taking protein which is verified by a biological experiment in cerebrospinal fluid as a positive sample for model training and storing protein information data of the positive sample;

thirdly, dividing the positive sample information data and the negative sample information data according to an 80% training set, a 10% verification set and a 10% testing set;

classifying the protein features to obtain a protein feature vector preliminarily;

step eight, evaluating the classifier model

the obtained AUC is less than 90%, and the seven steps are repeated to fit and train the classifier model until the AUC reaches more than 90%;

The protein characteristics in the fourth step are classified according to 4 major categories, which are respectively as follows: 1) sequence properties, 2) structural properties, 3) domain and motif properties, 4) physicochemical properties.

And in the step five, a significant level threshold value p-value < 0.005 is adopted in the t test method.

In the fifth step, a selection judgment function DJ (i) of the support vector machine combined feature removal algorithm (SVM-RFE) is defined as follows:

y_iis a sample x_iLabel of (a), y_jIs a sample x_jLabel of (2), K (x)_i，x_j) Is a test x_iAnd x_jThe kernel function of similarity, α, is the value obtained after training by SVM, T represents the transpose of the matrix, and H represents the matrix.

In the sixth step, the deep neural network is defined as follows:

Y＝W·X+b

The structure of the deep neural network in the sixth step comprises an input layer, a hidden layer and an output layer; the hidden layer has 4 layers, the number of neurons of the hidden layer is 500, and an activation function used by the hidden layer is ReLU; the number of neurons in the output layer is 1, and the activation function used by the output layer is Sigmoid.

The definitions of the activation functions ReLU and Sigmoid are as follows:

ReLU(z)＝max(0，z)

where z is the sum of the weights of the neurons, max is a maximum function, and e is a natural constant.

The structure of the deep neural network is defined as follows:

Output＝Out(Hidden(Hidden(Hidden(Hidden(X)))))

where Hidden represents the Hidden layer and Out represents the output layer.

The loss function of cross entropy is a binary cross entropy (binary cross entropy) defined as follows:

wherein y is_iThe true category of the ith sample is represented,

and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

In the step eight, the Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), Precision (Precision), Markenss correlation coefficient MCC (Matthews correlation coefficient) and the area Under the ROC curve (area Under Roc Current) are respectively given as:

Through the design scheme, the invention can bring the following beneficial effects:

the invention takes a protein list which is verified by biological experiments in cerebrospinal fluid of the existing literature and database as a positive sample of model training; and (3) deleting the protein family information corresponding to the positive sample in the step one from the Pfam protein family information database, searching the protein families with the protein quantity more than 10 in the families from the rest protein family information database, and randomly selecting 10 protein information from the protein families as the negative sample of model training. And (3) carrying out feature selection on the protein features by using a t test and SVM-RFE method, and removing noise and irrelevant features. And (3) building a model based on a deep neural network, inputting protein characteristics, outputting prediction results, training the model by a training set, performing parameter adjustment by a verification set, performing performance evaluation by a test set, improving the accuracy of cerebrospinal fluid prediction, and finally realizing the protein prediction in the cerebrospinal fluid by a computable method.

Detailed Description

The deep neural network-based cerebrospinal fluid protein prediction method comprises the following steps of:

1. creation of data sets

(1) Positive sample data set collection

And (3) acquiring protein information which is verified by biological experiments in cerebrospinal fluid and is used as a positive sample of model training by searching biological relevant documents and an existing database, and inputting the positive sample into a computer.

(2) Negative sample data set collection

And (3) deleting the protein family information corresponding to the positive sample in the step one from the Pfam protein family information database, searching the protein families with the protein quantity more than 10 in the families from the rest protein family information database, and randomly selecting 10 protein information from the protein families as the negative sample of model training.

(3) Model training dataset segmentation

And segmenting all sample data of the positive sample and the negative sample according to an 80% training set, a 10% verification set and a 10% testing set.

2. Protein feature selection

(1) Feature collection

Protein features are classified into 4 broad categories, and approximately 3000 feature vectors can be obtained. As shown in table 1:

TABLE 1 protein feature Classification

(2) Feature selection

Firstly, filtering the characteristic elements by using a t-test method, removing irrelevant characteristics, and adopting a significant level threshold value p-value of 0.005; and then, carrying out feature selection by using a support vector machine combined feature removal algorithm (SVM-RFE) to obtain a feature vector for model training. The decision function dj (i) is defined as follows:

wherein, y_iIs a sample x_iLabel of (a), y_jIs a sample x_jLabel of (2), K (x)_i，x_j) Is a test x_iAnd x_jThe kernel function of similarity, α, is the value obtained after training by SVM, T represents the transpose of the matrix, and H represents the matrix.

3. Training based on deep neural network classifier

(1) Neural network model fitting training

And establishing a classifier model through a deep neural network, training the model by using a training set, adjusting parameters by using a verification set, and evaluating the performance by using a test set.

The deep neural network is composed of an input layer, a hidden layer and an output layer, wherein the input layer only represents the output of data, and the hidden layer is defined as follows:

Y＝W·X+b

The hidden layer has 4 layers, the number of neurons of the hidden layer is 500, and the activation function is ReLU; the number of neurons in the output layer is 1 and the activation function is Sigmoid.

The activation functions ReLU and Sigmoid are defined as follows, respectively

ReLU(z)＝max(0，z)

Where z is the sum of the weights of the neurons, max is a maximum function, and e is a natural constant.

The loss function used to train the deep neural network is a binary cross entropy (binary cross entropy) defined as follows:

wherein y is_iThe true category of the ith sample is represented,

and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

The structure of the deep neural network is defined as follows:

Output＝Out(Hidden(Hidden(Hidden(Hidden(X)))))

where Hidden represents the Hidden layer and Out represents the output layer.

4. Model performance assessment

Inputting the protein information of the positive samples and the negative samples in the verification set and the feature vector for model training obtained in the steps into a trained classifier model for verification, using Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), accuracy (Precision), Markov correlation coefficient MCC (Matthews correlation coefficient) and area AUC (area Under RocCurve) Under an ROC curve as indexes for evaluating the model verification effect, obtaining the AUC of less than 90%, and re-fitting the trained classifier model until the AUC reaches more than 90%.

Wherein, the Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), Precision (Precision), Markov correlation coefficient MCC (Matthews correlation coefficient) and the area AUC (area Under Roc Current) Under the ROC curve are respectively as follows:

wherein TP represents the number of true positive samples, TN identifies the number of true negative samples, FP represents the number of false positive samples, FN represents the number of false negative samples, and N represents the number of all training samples.

And finally, carrying out classification accuracy verification on the verified classifier model by using a test set, wherein the classification accuracy is less than 90%, carrying out classifier fitting training and model verification again until the classification accuracy reaches more than 90%, and completing the establishment of the deep neural network-based cerebrospinal fluid protein prediction model.

The input of the model is protein characteristic vector, and the output is prediction result. The accuracy of cerebrospinal fluid prediction is improved, and the prediction of cerebrospinal fluid protein is finally realized. Protein prediction in cerebrospinal fluid is achieved by a calculable method, and protein related to diseases is found through the predicted protein.

11页详细技术资料下载

Deep neural network-based cerebrospinal fluid protein prediction method

相关技术

网友询问留言