Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix

文档序号：1939960 发布日期：2021-12-07 浏览：22次中文

阅读说明：本技术 基于位置特异性得分矩阵的囊泡转运蛋白识别方法及识别设备 (Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix ) 是由汪国华宫越邹权于 2021-09-10 设计创作，主要内容包括：基于位置特异性得分矩阵的囊泡转运蛋白识别方法及识别设备,本发明涉及囊泡转运蛋白识别方法及识别设备。本发明的目的是为了解决现有囊泡转运蛋白的识别方法效率低、成本高的问题。过程为：S1、获取蛋白序列数据文件；S2、基于S1生成位置特异性得分矩阵,并采用AATP算法从位置特异性得分矩阵中提取特征向量；S3、使用不平衡处理算法得到处理后的特征向量；S4、采用MRMD算法得到特征向量集合；S5、采用XGBoost作为分类器,并进行超参数优化；S6、得到训练好的分类器模型；S7、将待测数据集输入训练好的分类器模型得到分类结果,完成对囊泡转运蛋白的识别。本发明用于蛋白识别领域。(The invention relates to a vesicle transport protein identification method and identification equipment based on a position specificity scoring matrix. The invention aims to solve the problems of low efficiency and high cost of the existing vesicle transport protein identification method. The process is as follows: s1, acquiring a protein sequence data file; s2, generating a position specificity score matrix based on S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm; s3, obtaining a processed feature vector by using an imbalance processing algorithm; s4, obtaining a characteristic vector set by adopting an MRMD algorithm; s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization; s6, obtaining a trained classifier model; and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein. The invention is used in the field of protein recognition.)

1. The vesicle transport protein recognition method based on the position specificity score matrix is characterized in that: the method comprises the following specific processes:

s1, acquiring a protein sequence data file;

s2, generating a position specificity score matrix based on the protein sequence data file acquired at S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm;

s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;

s4, performing feature selection on the processed feature vectors obtained in the step S3 by adopting an MRMD algorithm to obtain a feature vector set with strong correlation between features and categories and low redundancy among the features;

s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;

s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;

and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.

2. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 1, wherein: acquiring a protein sequence data file in the S1; the specific process is as follows:

acquiring a protein sequence data file, wherein the protein sequence data file comprises a positive example data set and a negative example data set;

the positive example data set is a sequence data file of vesicle transport protein, and the negative example data set is a sequence data file of non-vesicle transport protein.

3. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 2, wherein: generating a position specificity score matrix based on the protein sequence data file acquired at S1 in the S2, and extracting a feature vector from the position specificity score matrix by adopting an AATP algorithm; the specific process is as follows:

s21, carrying out error detection on the format and the content of the protein sequence data file obtained in the S1 to obtain a correct protein sequence data file; the specific process is as follows:

s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format;

s212, error detection is carried out on the content of the protein sequence data file with the correct format obtained in the S211, and the protein sequence data file with the correct format and content is obtained;

s22, using PSI-BLAST program to compare the correct protein sequence data file obtained in S21 with NCBI' S non-redundant database, obtaining a position specificity scoring matrix;

and extracting feature vectors from the position specificity scoring matrix by adopting a feature extraction algorithm AATP.

4. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 3, wherein: in S211, error detection is performed on the format of the protein sequence data file obtained in S1, so as to obtain a protein sequence data file with a correct format; the specific process is as follows:

when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;

when the line of the protein sequence data file acquired at S1 begins with the character ">", the data subsequent to this line includes the identification number information of the sequence, and the data of the next line is the text data of this protein sequence data file, a protein sequence data file in the correct format is obtained.

5. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 4, wherein: in S212, error detection is performed on the content of the protein sequence data file with the correct format obtained in S211, so as to obtain a protein sequence data file with both the correct format and the correct content; the specific process is as follows:

judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", the protein sequence data file obtained in S211 is correct, and performing S22;

if "B", "J", "O", "U", "X", or "Z" is included in the character string, the protein sequence data file acquired in S211 has an error, and it is necessary to delete "B", "J", "O", "U", "X", or "Z" included in the protein sequence data file acquired in S211 and perform S22.

6. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 5, wherein: processing the feature vector extracted in the step S2 by using an imbalance processing algorithm in the step S3 to obtain a processed feature vector; the specific process is as follows:

the unbalance processing algorithms are seven in total and are respectively Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek;

processing the feature vectors extracted in the step S2 by adopting seven imbalance processing algorithms to reduce the imbalance of the data, evaluating the accuracy rate through cross validation, and selecting the imbalance processing algorithm with the highest accuracy rate as the finally selected imbalance processing algorithm;

and processing the feature vector extracted in the step S2 by adopting a finally selected imbalance processing algorithm to obtain a processed feature vector.

7. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 6, wherein: in the step S4, an MRMD algorithm is adopted to perform feature selection on the processed feature vector obtained in the step S3, so that a feature vector set with strong correlation between features and categories and low redundancy among the features is obtained; the specific process is as follows:

sorting all the processed feature vectors obtained in the step S3 by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h respectively to obtain feature vector sets of five sorting modes;

respectively selecting features in the feature vector sets of the five sorting modes by adopting an MRMD algorithm to obtain feature vector sets of the five sorting modes after feature selection;

and comparing the obtained feature vector sets of the five sorting modes after feature selection through cross validation, and selecting the feature vector set with the highest accuracy.

8. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 7, wherein: the basis for respectively selecting the features in the five feature subset sorting modes by adopting the MRMD algorithm is Max (MR)_i+MD_i)；

In which MR_iDenotes the Pearson coefficient, MD, between the ith protein class and the feature_iRepresenting the Euclidean distance between the ith protein class and the feature;

in which maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iFeature vector representing the ith protein, C_iRepresenting the class of the ith protein, M representing the characteristic dimension of the protein, S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iStandard deviation of all elements in, S_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iThe k-th element of (1), c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iRepresenting the Euclidean distance, COS, between the ith protein features_iDenotes the cosine distance, TC, between the ith protein feature_iThe trough coefficients between the ith protein features are indicated.

9. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 8, wherein: in the step S5, XGboost is adopted as a classifier, and hyper-parameter optimization is carried out; the specific process is as follows:

s51, initializing XGboost parameters:

learning rate learning _ rate is 0.1; the maximum iteration number n _ estimators is 200; maximum depth max _ depth is 5; min _ child _ weight ═ 1; gamma is 0; 0.8; colsample _ byte ═ 0.8;

s52, selecting an adjusting range by taking one parameter in the initial parameters as a variable, and keeping the other parameters unchanged; using XGboost built-in cross validation to iteratively search for the optimal parameter;

and S53, repeating the step S52 until all the parameters find the optimal values, obtaining the optimal parameters of all the parameters, and obtaining the optimal XGboost to be used as a classifier.

10. Position-specific score matrix based vesicle transporter identification apparatus, comprising a processor and a memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement a position-specific score matrix based vesicle transporter identification method according to one of claims 1 to 9.

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a vesicle transport protein identification method and identification equipment.

Background

In recent years, the research on vesicle transporters has been receiving more and more attention. During transport, vesicle transporters will assume the task of transporting macromolecules and particles when they are unable to cross the cell membrane. To date, many studies have demonstrated that aberrant vesicular transporters may cause a variety of diseases that severely compromise human health, such as the Hermansky-pudlak syndrome. In view of the importance of vesicular transporters in eukaryotic cells, researchers in the field of cell biology have been working on developing experimental techniques capable of identifying vesicular transporters with excellent results, such as morpholino knockdown and disection. These techniques can accurately identify the vesicle transporters, but these techniques are often inefficient and expensive, and thus it is necessary to find a time-saving and high-accuracy method for identifying the vesicle transporters.

Disclosure of Invention

The invention aims to solve the problems of low efficiency and high cost of the existing vesicle transporter identification method, and provides a vesicle transporter identification method and identification equipment based on a position specificity score matrix.

The vesicle transport protein recognition method based on the position specificity score matrix comprises the following specific processes:

s1, acquiring a protein sequence data file;

s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;

s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;

s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;

and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.

The vesicle transporter identification device based on the position specificity score matrix comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the vesicle transporter identification method based on the position specificity score matrix.

The invention has the beneficial effects that:

(1) the invention provides a brand-new vesicle transport protein identification method, which can realize accurate identification of vesicle transport protein by utilizing the characteristic extracted by a position specificity scoring matrix and provides a theoretical basis for corresponding drug development.

(2) The invention adopts a plurality of unbalance processing algorithms to reduce the unbalance of the data and make comparison, and finally selects the algorithm with the best performance. And then MRMD is used for reducing the characteristic dimension, so that the identification effect of the model is effectively improved.

(3) The XGboost is used as a learner, and the hyper-parameter optimization is carried out, so that the processing efficiency of the model on the vesicle transport protein is improved, and the identification cost is reduced.

Drawings

FIG. 1 is a flow chart of a method for identifying vesicle transport proteins based on position specificity matrix provided in the embodiment of the present invention;

fig. 2 is a schematic diagram of recognition effects of different feature extraction methods according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating recognition effects of different imbalance processing methods according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating recognition effects of different parameters in the dimension reduction algorithm according to the embodiment of the present invention;

fig. 5 is a schematic diagram of different learner recognition effects according to an embodiment of the present invention.

Detailed Description

The first embodiment is as follows: the vesicle transport protein recognition method based on the position specificity score matrix comprises the following specific processes:

s1, acquiring a protein sequence data file;

s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;

s1 it is known that the vesicle transport protein and the non-vesicle transport protein, the purpose of the unbalanced processing algorithm is to balance the two unbalanced quantities, and delete the data with large quantity to balance the two quantities;

using an imbalance processing algorithm to reduce the imbalance of the feature vector data extracted in S2 (data imbalance is to divide all data into two types, one type of vesicle transport protein and one type of data except for vesicle transport; then the two types differ greatly in number, for example, there are only 2000 vesicle transport proteins, but there are more than 7000 vesicle transport proteins, and the two types are unbalanced and need to be processed.);

s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;

s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;

and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.

The second embodiment is as follows: in this embodiment, unlike the first embodiment, a protein sequence data file is obtained in S1; the specific process is as follows:

acquiring a protein sequence data file (known websites, such as a UniProt database and a Gene Ontology website, wherein the UniProt database is a website specially providing protein related data), wherein the protein sequence data file comprises a positive example data set and a negative example data set;

the positive example data set is a sequence data file of vesicle transport protein, and the negative example data set is a sequence data file of non-vesicle transport protein.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the embodiment is different from the first or second embodiment in that a position-specific score matrix is generated based on the protein sequence data file acquired in S1 in S2, and an AATP algorithm is used to extract feature vectors from the position-specific score matrix; the specific process is as follows:

s21, before the position specificity score matrix is generated in the step S2, the error detection is carried out on the format and the content of the protein sequence data file obtained in the step S1, and the file with the wrong format can influence the subsequent steps to obtain the correct protein sequence data file; the specific process is as follows:

s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format;

s22, using PSI-BLAST program to compare the correct protein sequence data file obtained in S21 with NCBI' S non-redundant database, obtaining a position specificity scoring matrix;

the position specificity scoring matrix contains important evolution information of the protein, and the extraction of the characteristics from the matrix can effectively improve the effect of the vesicle transport protein recognition model.

And extracting feature vectors from the position specificity scoring matrix by adopting a feature extraction algorithm AATP.

The feature extraction algorithm AATP consists of two parts, namely AAC and TPC; the AAC, which is a 20-dimensional feature vector, represents the average score of each amino acid changed to other types of amino acids during the evolution of a protein. The TPC is a 400-dimensional characteristic obtained from a transition probability matrix, and can effectively avoid the loss of information in a sequence.

The most important information can be effectively extracted from the position specificity scoring matrix by adopting the AATP algorithm, and the efficiency and the performance of the vesicle transport protein are further improved.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment and one of the first to third embodiments is that in S211, an error detection is performed on the format of the protein sequence data file obtained in S1, so as to obtain a protein sequence data file with a correct format; the specific process is as follows:

when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;

when the line of the protein sequence data file acquired at S1 begins with the character ">", the data subsequent to this line includes information such as the identification number, position, etc. of the sequence, and the data of the next line is the text data of this protein sequence data file, then a protein sequence data file in the correct format is obtained;

protein sequence data files have many rows

|＞Q20300

MMDQILGTNFTYEGAKEVARGLEGFSAKLAVGYIATIFGLKYYMKDRK

＞D3ZGS3

MEPRLPIGAQPLACLHMVAGLEMKGPLREPCVLTLARRNGQYELIIQLI

＞A2AUC9

MDSQRELAEELRLYQSTLLQDGLKDLLEEKKFIDCTLKAGDKSFPCHRLI

＞O18037

MEAANEVVNLFASQATTPSSLDAVTTLETVSTPTFIFPEVSDSQILQLMI

＞H2E7T7

MALDLLSSYAPGLVESLLTWKGAAGLAAAVALGYIIISNLPGRQVAKPS

＞Q04LE4

MISRFFRHLFEALKSLKRNGWMTVAAVSSVMITLTLVAIFASVIFNTAKI

＞G0Y287

MVKLVEVLQHPDEIVPILQMLHKTYRAKRSYKDPGLAFCYGMLQRVSF

">" is followed by the identification number of the protein, as in "Q20300" in the first row, and then the next row immediately below is its sequence.

The information following ">" has at least one identification number, and other information is not necessary, and sometimes there are two pieces of information, length and type.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that in S212, an error detection is performed on the content of the protein sequence data file with the correct format obtained in S211, so as to obtain a protein sequence data file with both the correct format and the correct content; the specific process is as follows:

the amino acids are 20 kinds, and are respectively represented by 20 letters, and the 20 letters do not contain 'B', 'J', 'O', 'U', 'X' or 'Z';

judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", prompting that the protein sequence data file obtained in S211 is correct, and performing S22;

if "B", "J", "O", "U", "X", or "Z" is included in the character string, it is suggested that there is an error in the protein sequence data file acquired in S211, and S22 needs to be performed for deleting "B", "J", "O", "U", "X", or "Z" (including several deletions) included in the protein sequence data file acquired in S211.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and the first to fifth embodiments is that the feature vector extracted in S2 is processed by using an imbalance processing algorithm in S3 to obtain a processed feature vector; reduce the imbalance of the data; the specific process is as follows:

a tool called imblearn is used which provides algorithms Clustercentroids, NearMiss, ENN, Randomander, Smote, SmoteENN, and SmoteTomek.

The unbalance processing algorithms are seven in total and are respectively Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek;

processing the feature vector extracted in the step S2 by adopting a finally selected imbalance processing algorithm to obtain a processed feature vector so as to reduce the imbalance of the data;

a cross-validation method is adopted. Cross-validation is also known, that is, dividing the data into 5, taking 4 of them to train the learner, and then testing the remaining 1 to see how much the one can be successfully identified. The cross validation can obtain a plurality of indexes such as accuracy, sensitivity, recall rate and the like, and generally the selection accuracy is the highest.

In the step, the condition that other conditions are unchanged is kept, only the imbalance processing algorithm is changed for comparison, and then the algorithm with the best effect performance is selected to be applied to the subsequent steps.

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is that, in S4, the MRMD algorithm is used to perform feature selection on the processed feature vector obtained in S3, so as to obtain a feature vector set in which features have strong correlation with categories and low redundancy between features; the specific process is as follows:

for example, all the feature vectors extracted by S2 have 5 features, and the 5 features are sorted by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h respectively to obtain feature vector sets of five sorting modes (the feature vector set of each sorting mode is composed of different sorting modes of 5 features);

respectively selecting features in the feature vector sets of the five sorting modes by using an MRMD algorithm (for example, if 5 features in the feature vector set of each sorting mode are too many, the MRMD algorithm can screen the 5 features in the feature vector set of each sorting mode), and obtaining the feature vector sets of the five sorting modes after feature selection;

using a Pearson correlation coefficient to balance the correlation between the feature subsets in the protein feature set obtained in the step S3 and two target classes of the vesicle transport protein and the non-vesicle transport protein by adopting an MRMD algorithm, and using a plurality of distance functions to obtain the redundancy of each feature subset; the redundancy of the feature subset selected by the MRMD is low, and the relevance of the feature subset selected by the MRMD and the target class is strong.

For example, the feature vector set of each ranking mode in the feature vector sets of 5 ranking modes is composed of different ranking modes of 5 features, 5 features in the feature vector set of each ranking mode are too many, and a most useful part of the features needs to be screened out, the MRMD algorithm screens 5 features in the feature vector set of each ranking mode, and the MRMD algorithm screens 5 features in the feature vector set of each ranking mode from the first feature and adds one feature into a feature subset until how many features are added, so that the effect of the feature subset is the best. Such as

Feature vector { name, gender, age, height }

After sorting, it becomes { age, name, height, sex }

Name is one of the features, and its feature subset includes { name }, { name, gender }, { gender, age }, etc.; first Max (MR) of the first subset of features { name }_i+MD_i) Then, a second subset of features is computed { name, gender })Max (MR)_i+MD_i) And so on, select Max (MR)_i+MD_i) The largest feature subset is used as a feature vector set after feature selection of the feature vector of the sort mode, the feature vector set of each sort mode is selected, and the feature vector set of each sort mode after feature selection is obtained;

different feature sorting modes have different sorting results, and finally selected feature subsets are different.

The role of MRMD is to screen the features in the feature vector set.

The distance function comprises an Euclidean distance function, a cosine distance function and a valley coefficient function. The three functions are used to calculate the distance between each feature subset and the target class, and the addition of the distances is redundancy.

And comparing the obtained feature vector sets of the five sorting modes after feature selection through cross validation, and selecting the feature vector set with the highest accuracy.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The specific implementation mode is eight: the difference between this embodiment and the first to seventh embodiments is that the basis for selecting features in the five feature subset sorting modes by using the MRMD algorithm is Max (MR)_i+MD_i)；

In which MR_iDenotes the Pearson coefficient, MD, between the ith protein class and the feature_iRepresenting the Euclidean distance between the ith protein class and the feature;

in which maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iCharacteristic vector representing the ith protein (vesicular transporter or non-vesicular transporter), C_iRepresents the class of the i-th protein (vesicular or non-vesicular transporter), M represents the characteristic dimension of the protein (vesicular or non-vesicular transporter), S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iStandard deviation of all elements in, S_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iThe k-th element of (1), c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iRepresenting the Euclidean distance (vesicular or non-vesicular transporter), COS, between the i-th protein features_iDenotes the Cosine (Cosine) distance (vesicle transporter or non-vesicle transporter), TC, between the ith protein signature_iRepresent the trough (Tanimoto) coefficient (vesicular transporter or non-vesicular transporter) between the ith protein signature.

Other steps and parameters are the same as those in one of the first to seventh embodiments.

The specific implementation method nine: the difference between the present embodiment and the first to eighth embodiments is that the XGBoost is adopted as the classifier in S5, and the hyper-parameter optimization is performed; the specific process is as follows:

s51, initializing XGboost parameters: learning rate learning _ rate is 0.1; the maximum iteration number n _ estimators is 200; maximum depth max _ depth is 5; min _ child _ weight ═ 1; gamma is 0; 0.8; colsample _ byte ═ 0.8;

and S53, repeating the step S52 until all the parameters find the optimal values, obtaining the optimal parameters of all the parameters, and obtaining the optimal XGboost to be used as a classifier.

Other steps and parameters are the same as those in one to eight of the embodiments.

The detailed implementation mode is ten: the vesicle transporter identification device based on the position specificity score matrix of the present embodiment comprises a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to realize the vesicle transporter identification method based on the position specificity matrix according to one of the first embodiment to the ninth embodiment.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It is to be understood that the embodiments shown and described in the drawings are merely exemplary and are intended to illustrate the principles and spirit of the invention, not to limit the scope of the invention.

The embodiment of the invention provides a vesicle transport protein recognition method based on a position specificity score matrix, as shown in figure 1, comprising the following steps of S1-S7:

and S1, downloading a protein sequence data file.

The acquired original protein characteristic data set comprises a positive case data set and a negative case data set, wherein the positive case data set is a vesicle transport protein sequence file, and the negative case data set is a non-vesicle transport protein sequence file.

In the present example, the total number of protein sequence data files is 2, and the sequence data files are a sequence data file of a vesicle transporter (containing a positive example vesicle transporter sequence of 9086) and a sequence data file of a non-vesicle transporter (containing a negative example non-vesicle transporter sequence of 2533).

S2, generating a position specificity score matrix based on the protein sequence data file acquired at S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm; the specific process is as follows:

s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format; the specific method comprises the following steps:

when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;

the amino acids are 20 kinds, and are respectively represented by 20 letters, and the 20 letters do not contain 'B', 'J', 'O', 'U', 'X' or 'Z';

judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", prompting that the protein sequence data file obtained in S211 is correct, and performing S22;

if "B", "J", "O", "U", "X", or "Z" is included in the character string, it is suggested that there is an error in the protein sequence data file acquired in S211, and S22 is performed to delete "B", "J", "O", "U", "X", or "Z" (including several deletions) included in the protein sequence data file acquired in S211;

s22, using PSI-BLAST program to compare the correct protein sequence data file obtained in S21 with NCBI' S non-redundant database, obtaining a position specificity scoring matrix;

And extracting feature vectors from the position specificity scoring matrix by adopting a feature extraction algorithm AATP.

And S3, reducing the unbalance of the data by using an unbalance processing algorithm.

In the step, various unbalance processing methods provided by a Python software package Imbalanced-spare are used for comparison, and finally, an algorithm with the best effect is selected. Seven algorithms are adopted, including Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek. In the step, the condition that other conditions are unchanged is kept, only the imbalance processing algorithm is changed for comparison, and then the algorithm with the best effect performance is selected to be applied to the subsequent steps.

S4, sorting all the processed feature vectors obtained in the S3 by respectively adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h to obtain five feature subset sorting modes;

respectively selecting characteristics of the five characteristic subset sorting modes by using an MRMD algorithm (for example, if 5 characteristics in each characteristic subset are too many, the MRMD algorithm can screen 5 characteristics in each characteristic subset), and obtaining the five characteristic subset sorting modes after characteristic selection;

and comparing the obtained five characteristic subset sorting modes after characteristic selection through cross validation, and selecting the characteristic vector set with the highest accuracy.

The MRMD algorithm uses Pearson correlation coefficients to balance the correlation between feature subsets and target classes and uses a variety of distance functions to obtain the redundancy of each feature subset. The redundancy between features is characterized by Euclidean distance, which is related to Euclidean distance ED, Cosine distance COS and Tanimoto coefficient TC, and the larger the Euclidean distance, the lower the redundancy between features.

Based on the theory, the basis for selecting the features of the feature set by adopting the MRMD algorithm is Max (MR)_i+MD_i) Wherein MR_iDenotes the Pearson coefficient, MD, between the ith protein class and the feature_iDenotes the Euclidean distance between the ith protein features, where maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iFeature vector representing the ith protein, C_iClass vector representing the ith protein, M the characteristic dimension of the protein, S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iStandard deviation of all elements in, S_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iThe k-th element of (1), c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iRepresenting the Euclidean distance between the i-th protein features, COS_iDenotes the Cosine distance, TC, between the i-th protein features_iRepresenting Tanimoto coefficients between the ith protein features.

S5, adopting XGboost as a learner, and carrying out hyper-parameter optimization;

the step S5 includes the following substeps S51-S54:

s51, initializing and setting XGboost parameters:

learning_rate＝0.1；n_estimators＝200；max_depth＝5；min_child_weight＝1；gamma＝0；subsample＝0.8；colsample_bytree＝0.8。

and S52, selecting an adjusting range by taking one parameter in the initial parameters as a variable, and keeping the other parameters unchanged. Using XGboost built-in cross validation to iteratively search for the optimal parameter;

s53, repeating the step S52 until all the parameters find the optimal values;

and S54, obtaining the optimal parameters, and putting XGboost into training.

S6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;

and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.

The recognition effect of the present invention is further described below with a set of specific experimental examples.

Firstly, we compare the recognition effect of the AATP algorithm and other feature extraction methods based on the position specificity score matrix on the vesicle transport protein, as shown in fig. 2, wherein the evaluation indexes include ACC, SN, SP and MCC, and the calculation formula is as follows:

as can be seen from fig. 2, the AATP algorithm is better than other algorithms in terms of classification effect. The AATP algorithm can effectively extract information from the position specificity scoring matrix, thereby improving the recognition effect of the vesicle transport protein.

The different imbalance processing methods are then compared. The invention totally adopts seven unbalanced processing methods, including Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN, SmoteTomek and the like, and the comparison result is shown in figure 3. As can be seen from fig. 3, ENN is the best algorithm, and the ENN algorithm performs data cleaning on the side with a larger number of positive and negative samples to filter out a representative sample set. Subsequent experiments will employ the ENN algorithm to unbalanced process the data.

Then, the results obtained by different parameters in the MRMD3.0 algorithm adopted in the invention are compared. There are five sorting modes for users to select in MRMD3.0, including Hits-a, TrustRank, PageRank, LeaderRank and Hits-h, and the comparison results of these five methods are shown in FIG. 3. As can be seen from FIG. 4, Hits-h is selected because it works best among the various indicators.

Finally, we use different learners for comparison, including XGBoost, RF, KNN, and SVM. The comparison results are shown in fig. 5. As can be seen from fig. 5, the XGBoost is certainly the best choice, and the XGBoost has higher accuracy while ensuring extremely high efficiency, so the XGBoost is adopted in the present invention.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

18页详细技术资料下载

Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix

相关技术

网友询问留言