Prediction method of protein secreted into bronchoalveolar lavage fluid

文档序号:1244128 发布日期:2020-08-18 浏览:35次 中文

阅读说明:本技术 一种分泌入支气管肺泡灌洗液蛋白质预测方法 (Prediction method of protein secreted into bronchoalveolar lavage fluid ) 是由 邵丹 黄岚 王岩 何凯 于 2020-04-26 设计创作,主要内容包括:一种分泌入支气管肺泡灌洗液蛋白质预测方法,属于人工智能检测技术领域,将现有文献和数据库的支气管肺泡灌洗液中已经被生物实验验证的蛋白质列表作为模型训练的样本,以蛋白质序列作为模型输入,利用RNN和LSTM构建运算模型,对入支气管肺泡灌洗液蛋白进行预测。本发明通过可计算的方法实现支气管肺泡灌洗液中的蛋白质预测,并通过预测的蛋白质,找到疾病相关蛋白进行病理分析,促进疾病的早期诊断。(A method for predicting proteins secreted into bronchoalveolar lavage fluid belongs to the technical field of artificial intelligence detection, and comprises the steps of taking a protein list which is verified by biological experiments in bronchoalveolar lavage fluid of existing documents and databases as a sample for model training, taking a protein sequence as model input, and constructing an operation model by utilizing RNN and LSTM to predict the proteins entering the bronchoalveolar lavage fluid. The invention realizes the protein prediction in the bronchoalveolar lavage fluid by a calculable method, finds the disease-related protein for pathological analysis by the predicted protein, and promotes the early diagnosis of the disease.)

1. A prediction method of a protein secreted into bronchoalveolar lavage fluid is characterized by comprising the following steps: comprises the following steps which are sequentially carried out,

taking protein verified by a biological experiment in bronchoalveolar lavage fluid as a positive sample of model training, and storing positive sample protein information data;

step two, deleting the protein family information corresponding to the positive sample in the step one from a Pfam protein family information database, extracting protein families with more than 5 proteins in the families from the rest protein family information database, selecting 5 protein information as model training negative samples, and storing the protein information data of the negative samples;

step three, balancing the number of positive samples and negative samples by adopting a random undersampling method to obtain balanced positive and negative samples;

randomly segmenting the protein information data of the positive sample and the negative sample according to an 80% training set, a 10% verification set and a 10% testing set;

step five, calculating a sequence position specific weight matrix PSSM of the protein in the sample by using position-related iterative BLAST;

step six, establishing a classifier model by combining a convolutional neural network (RNN) with long-term and short-term memory (LSTM), wherein the input of the classifier model is the specific weight matrix PSSM obtained in the step five, and the output of the classifier model is cerebrospinal protein or non-cerebrospinal protein;

step seven, the training set in the step four adopts an activation function and a loss function of cross entropy to fit a classifier model, and a trained classifier model is obtained;

step eight, inputting the protein information of the positive samples and the negative samples in the verification set in the step four and the specific weight matrix PSSM obtained in the step five into the classifier model trained in the step six for verification to obtain a verified classifier model; the output verification result adopts Sensitivity, Specificity, Precision accurve, accuracy Precision, Markuis correlation coefficient MCC and area AUC under the ROC curve as the evaluation index for evaluating the model verification effect;

step nine, carrying out classification accuracy verification on the classifier model verified in the step eight by adopting the test set in the step four, wherein the classification accuracy is less than 90%, repeating the step six and the step seven until the classification accuracy reaches more than 90%, and completing the establishment of the classifier model;

step ten, inputting an independent verification set protein sequence into the classifier model established in the step nine, and realizing the prediction method of the protein secreted into the bronchoalveolar lavage fluid through the output prediction result.

2. The method of claim 1, wherein the method comprises the steps of: the method for establishing the classifier model by combining the convolutional neural network RNN and the long-short term memory LSTM in the sixth step comprises the following steps,

wherein Y is(t)Is the output value of the current layer at the moment t, phi is the activation function, X(t)Is the output value of the current layer, WxAs a weight of the current input value, Y(t-1)For the output of the current layer at the previous moment, WyIs the weight of the output value at the last moment, b is the bias term of the current layer, and W is the sum of WxAnd WyCombining the formed matrixes;

the long-short term memory LSTM adopts a bidirectional long-short term memory LSTM.

3. The method of claim 1, wherein the method comprises the steps of: the activation function in the seventh step includes Tanh and Sigmoid, whose models are,

wherein z is the weight sum of the neurons, and e is a natural constant;

the cross entropy loss function model L in the seventh step is,

wherein y isiThe true category of the ith sample is represented,and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

4. The method of claim 1, wherein the method comprises the steps of: and the verification result output by the classifier model verified in the step eight adopts Sensitivity, Specificity, Precision, accuracy accurve, Precision, McCoresi correlation coefficient MCC and area AUC under the ROC curve as evaluation indexes for evaluating the model verification effect.

5. The method of claim 4, wherein the method comprises the steps of: verification result of the classifier model output

The model for Sensitivity is,

the model for the Specificity was that,

the model for the accuracy Precision is that,

the accuracy rate accuracy model is,

The model for the mazis correlation coefficient MCC is,

wherein, TP is the number of true positive samples, TN is the number of true negative samples, FP is the number of false positive samples, FN is the number of false negative samples, and N is the number of all training samples.

Technical Field

The invention belongs to the technical field of artificial intelligence detection, and particularly relates to a prediction method of a protein secreted into bronchoalveolar lavage fluid.

Background

The bronchoalveolar lavage fluid is obtained by collecting alveolar surface lining fluid after irrigating lung segments and sub-lung segments below bronchi by using a fiber bronchoscope. The kit is clinically used for diagnosing various lung diseases, such as the clinical diagnosis, differential diagnosis, research on etiology, pathogenesis, curative effect evaluation, prognosis and the like of pulmonary alveolitis, pulmonary fibrosis, asbestosis, lung cancer, pulmonary cysticercosis, pulmonary alveolar proteinosis and the like.

By analyzing protein markers in the bronchoalveolar lavage fluid, early diagnosis of lung diseases is achieved. However, currently, there is still a gap in the known methods for the calculable prediction of bronchoalveolar lavage fluid proteins.

Therefore, there is a need in the art for a new solution to solve this problem.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: provides a prediction method of protein secreted into bronchoalveolar lavage fluid, which solves the technical problem that the prediction of bronchoalveolar lavage fluid protein by the currently known method is still blank.

A prediction method of a protein secreted into bronchoalveolar lavage fluid is characterized by comprising the following steps: comprises the following steps which are sequentially carried out,

taking protein verified by a biological experiment in bronchoalveolar lavage fluid as a positive sample of model training, and storing positive sample protein information data;

step two, deleting the protein family information corresponding to the positive sample in the step one from a Pfam protein family information database, extracting protein families with more than 5 proteins in the families from the rest protein family information database, selecting 5 protein information as model training negative samples, and storing the protein information data of the negative samples;

step three, balancing the number of positive samples and negative samples by adopting a random undersampling method to obtain balanced positive and negative samples;

randomly segmenting the protein information data of the positive sample and the negative sample according to an 80% training set, a 10% verification set and a 10% testing set;

step five, calculating a sequence position specific weight matrix PSSM of the protein in the sample by using position-related iterative BLAST;

step six, establishing a classifier model by combining a convolutional neural network (RNN) with long-term and short-term memory (LSTM), wherein the input of the classifier model is the specific weight matrix PSSM obtained in the step five, and the output of the classifier model is cerebrospinal protein or non-cerebrospinal protein;

step seven, the training set in the step four adopts an activation function and a loss function of cross entropy to fit a classifier model, and a trained classifier model is obtained;

step eight, inputting the protein information of the positive samples and the negative samples in the verification set in the step four and the specific weight matrix PSSM obtained in the step five into the classifier model trained in the step six for verification to obtain a verified classifier model; the output verification result adopts Sensitivity, Specificity, Precision accurve, accuracy Precision, Markuis correlation coefficient MCC and area AUC under the ROC curve as the evaluation index for evaluating the model verification effect;

step nine, carrying out classification accuracy verification on the classifier model verified in the step eight by adopting the test set in the step four, wherein the classification accuracy is less than 90%, repeating the step six and the step seven until the classification accuracy reaches more than 90%, and completing the establishment of the classifier model;

step ten, inputting an independent verification set protein sequence into the classifier model established in the step nine, and realizing the prediction method of the protein secreted into the bronchoalveolar lavage fluid through the output prediction result.

The method for establishing the classifier model by combining the convolutional neural network RNN and the long-short term memory LSTM in the sixth step comprises the following steps,

wherein Y is(t)Is the output value of the current layer at the moment t, phi is the activation function, X(t)Is the output value of the current layer, WxAs a weight of the current input value, Y(t-1)For the output of the current layer at the previous moment, WyIs the weight of the output value at the last moment, b is the bias term of the current layer, and W is the sum of WxAnd WyCombining the formed matrixes;

the long-short term memory LSTM adopts a bidirectional long-short term memory LSTM.

The activation function in the seventh step includes Tanh and Sigmoid, whose models are,

wherein z is the weight sum of the neurons, and e is a natural constant;

the cross entropy loss function model L in the seventh step is,

wherein y isiThe true category of the ith sample is represented,and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

And the verification result output by the classifier model verified in the step eight adopts Sensitivity, Specificity, Precision, accuracy accurve, Precision, McCoresi correlation coefficient MCC and area AUC under the ROC curve as evaluation indexes for evaluating the model verification effect.

Verification result of the classifier model output

The model for Sensitivity is,

the model for the Specificity was that,

the model for the accuracy Precision is that,

the model of the precision ratio accuracycacy is,

the model for the mazis correlation coefficient MCC is,

wherein, TP is the number of true positive samples, TN is the number of true negative samples, FP is the number of false positive samples, FN is the number of false negative samples, and N is the number of all training samples.

Through the design scheme, the invention can bring the following beneficial effects: a protein prediction method for secreting bronchoalveolar lavage fluid comprises the steps of taking a protein list which is verified by biological experiments in bronchoalveolar lavage fluid of existing documents and databases as a sample for model training, taking a protein sequence as model input, and constructing an operation model by utilizing RNN and LSTM to predict the protein secreting bronchoalveolar lavage fluid. The protein prediction in the bronchoalveolar lavage fluid is realized by a calculable method, and the disease-related protein is found for pathological analysis by the predicted protein, so that the early diagnosis of the disease is promoted.

Detailed Description

The present invention will be further described with reference to specific embodiments below, a method for predicting protein secretion into bronchoalveolar lavage fluid, comprising the steps of,

1. creation of data sets

(1) Positive sample data set collection

And obtaining protein information which is verified by biological experiments in the bronchoalveolar lavage fluid and is used as a positive sample of model training to be input into a computer by searching biological relevant documents and an existing database.

(2) Negative sample data set collection

And (3) deleting the protein family information corresponding to the positive sample in the step one from the Pfam protein family information database, searching the protein families with the protein number more than 5 in the families from the rest protein family information database, and randomly selecting 5 protein information from the protein families as the negative sample of model training.

(3) Model training dataset segmentation

And segmenting all sample data of the positive sample and the negative sample according to an 80% training set, a 10% verification set and a 10% testing set.

2. Positive and negative sample equalization

Considering the problem that the prediction result is inaccurate due to the difference of the number of the positive and negative samples, a Random Undersampling (RU) method is adopted to balance the number of the positive and negative samples, and the set with excessive number of samples is deleted to obtain balanced positive and negative samples.

3. PSSM for calculating protein sequence

The sequence position specific weight matrix (PSSM) of the proteins in the sample is calculated using position-correlated iterative BLAST (PSI-BLAST) as input to the model operation.

4. Classifier model established based on convolutional neural network combined with long-term and short-term memory

(1) Neural network model fitting training

The convolutional neural network consists of an input layer, a convolutional layer plus an LSTM layer, and an output layer, where the input layer represents the output of data only, and the convolutional layer is defined as follows:

wherein Y is(t)Represents the output value of the current layer at the moment t, phi is an activation function, X(t)Representing the output value, W, of the current layerxWeight representing current input value, Y(t-1)Representing the current time of dayOutput of a layer, WyWeight representing the output value at the last moment, b represents the bias term of the current layer, and W represents the weight represented by WxAnd WyThe composed matrices are combined.

LSTM employs bidirectional LSTM.

The activation functions Tanh and Sigmoid are defined as follows, respectively

Where z is the sum of the weights of the neurons and e is a natural constant.

The model output is a binary cross entropy (binary cross entropy) defined as follows:

wherein y isiThe true category of the ith sample is represented,and (3) representing the prediction category of the ith sample, wherein log is a logarithmic function, and m is the number of samples.

5. Model performance assessment

Inputting the protein information of the positive samples and the negative samples in the verification set and the feature vector for model training obtained in the steps into a trained classifier model for verification, using Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), accuracy (Precision), Markov correlation coefficient MCC (Matthews correlation coefficient) and area AUC (area Under RocCurve) Under an ROC curve as indexes for evaluating the model verification effect, obtaining the AUC of less than 90%, and re-fitting the trained classifier model until the AUC reaches more than 90%.

Wherein, the Sensitivity (Sensitivity), Specificity (Specificity), Precision (accuracy), Precision (Precision), Markov correlation coefficient MCC (Matthews correlation coefficient) and the area AUC (area Under Roc Current) Under the ROC curve are respectively as follows:

wherein TP represents the number of true positive samples, TN identifies the number of true negative samples, FP represents the number of false positive samples, and FN represents the number of false negative samples. N represents the number of all training samples.

And finally, carrying out classification accuracy verification on the verified classifier model by using a test set, wherein the classification accuracy is less than 90%, carrying out classifier fitting training and model verification again until the classification accuracy reaches more than 90%, and completing the establishment of a prediction model secreting bronchoalveolar lavage fluid proteins.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于高通量测序的物种鉴定系统和方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!