T cell receptor corresponding epitope prediction method based on multiconnector characteristics

文档序号：1339745 发布日期：2020-07-17 浏览：12次中文

阅读说明：本技术 一种基于多连体特征的t细胞受体对应表位预测方法 (T cell receptor corresponding epitope prediction method based on multiconnector characteristics ) 是由王嘉寅童瑶杨玲郑田刘涛李敏张选平于 2020-03-19 设计创作，主要内容包括：本发明公开了一种基于多连体特征的T细胞受体对应表位预测方法,将CDR3β链以及对应的表位解析为长度3的碱基,统计每种三联体的频次作为初始特征；根据得到的初始特征建立初始特征矩阵,使用主成分分析法对初始特征矩阵进行降维,进行特征提取；设有n个训练样本,输入预测数据x后,训练得到梯度提升决策树模型,通过梯度提升决策树模型将各个决策树的决策结果线性组合起来做出预测；将特征数据输入训练好的模型中进行预测,根据不同的预测目的选择不同的预测指标。本发明仅使用三联体的统计值作为初始特征,结合梯度提升决策树模型能够在极短的时间内完成模型的训练,且预测的准确度更高。(The invention discloses a T cell receptor corresponding epitope prediction method based on multiconnector characteristics, which comprises the steps of resolving a CDR3 β chain and a corresponding epitope into bases with the length of 3, counting the frequency of each triplet as initial characteristics, establishing an initial characteristic matrix according to the obtained initial characteristics, reducing the dimension of the initial characteristic matrix by using a principal component analysis method, extracting the characteristics, setting n training samples, inputting prediction data x, training to obtain a gradient lifting decision tree model, linearly combining decision results of each decision tree through the gradient lifting decision tree model to predict, inputting the characteristic data into the trained model to predict, and selecting different prediction indexes according to different prediction purposes.)

1. A T cell receptor corresponding epitope prediction method based on a concatemer characteristic is characterized by comprising the following steps:

s1, analyzing the CDR3 β chain and the corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as initial characteristics;

s2, establishing an initial feature matrix according to the initial features obtained in the step S1, and performing dimension reduction on the initial feature matrix by using a principal component analysis method to perform feature extraction;

s3, setting n training samples, training to obtain a gradient lifting decision tree model after inputting prediction data x, and linearly combining decision results of decision trees to make prediction through the gradient lifting decision tree model;

s4, inputting the characteristic data of the step S2 into the model trained in the step S3 for prediction, and selecting different prediction indexes according to different prediction purposes.

2. The T cell receptor corresponding epitope prediction method based on the concatemer characteristics of claim 1, wherein the step S2 specifically comprises:

s201, recording the initial characteristic matrix as: x ═ X₁,x₂,...,x_nCentering each column of features;

s202, order the sample point x_iThe projection on the hyperplane in the new space is W^Tx_iIf all the sample points are separated, the variance of the sample points after projection is maximized, and an optimization target is determined;

s203, solving the optimized target part by using a Lagrange multiplier method, and carrying out XX on the covariance matrix^TPerforming characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W^TX is a matrix of k rows and n columns.

3. The method for predicting T-cell receptor-corresponding epitopes based on concatemer characteristics according to claim 2, wherein in step S201, the m-dimensional column vector x₁Comprises the following steps:

wherein n is the number of training samples and m is the feature dimension.

4. The method for predicting T cell receptor-corresponding epitopes based on concatemer characteristics according to claim 2, wherein in step S202, the optimization objective is:

where W is the transformation matrix, W^TIs the transpose of the transformation matrix, X is the initial feature matrix, X^TIs the transpose of the initial feature matrix.

5. The T cell receptor corresponding epitope prediction method based on concatemer characteristics of claim 2, wherein in step S203, the optimization objective is solved to obtain

XX^TW＝λW

The projection matrix W is:

W＝(w₁,w₂,...,w_k)

wherein λ is a characteristic value, w_iIs the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k, and the ordering of the characteristic values is as follows: lambda [ alpha ]₁≥λ₂≥...≥λ_n。

6. The T cell receptor corresponding epitope prediction method based on the concatemer characteristics of claim 1, wherein the step S3 specifically comprises:

s301, initializing the iteration number M to be 0, setting the maximum iteration number to be M, and initializing a model f₀(x)；

S302, adding a decision tree on the basis of the current model in each model iteration, and using residual L (y, f)_m-1(x) Estimate parameter Θ)_m；

S303, if m is equal to m +1, if m is less than the maximum iteration count, returning to step S302; otherwise, stopping training, returning all the decision trees of the training, and finishing the training of the epitope prediction model.

7. The method for predicting T-cell receptor-corresponding epitopes according to claim 6, wherein the model f is initialized in step S301₀(x) Comprises the following steps:

where N is the number of samples, c is the constant of the initial model fit, L is the log-likelihood loss function defined as:

wherein Y is an output variable, X is an input variable, L is a loss function, M is the number of epitope classes, Y_ijIs a binary index, if the category j is input example x_iTrue class of (1), then y_ij1 is ═ 1; otherwise y_ij＝0，p_ijPredicting an input instance x for a model_iProbability of belonging to category j.

8. The method for predicting T cell receptor-corresponding epitopes based on concatemer characteristics according to claim 6, wherein in step S302, the result of the mth iteration is:

f_m(x)＝f_m-1(x)+β_mT(x；Θ_m)

wherein f is_m-1(x) Is the decision model for the m-1 th iteration, using all R_miSet of (2)i∈[1..n]To fit a regression classification decision tree.

9. The method of claim 8, wherein residual L (y, f) is used for predicting T cell receptor epitope mapping based on concatemer characteristics_m-1(x) Estimate parameter Θ)_mParameter theta of decision Tree_mThe method is obtained by solving the following optimization objectives:

loss function in model f_m-1The negative gradient above is used to approximate the estimate residual as:

where i is the index of the ith training sample.

10. The method for predicting T-cell receptor-corresponding epitopes according to claim 6, wherein, in step S303,

wherein f is_M(x) For the final result of M blocksAn integration model composed of a policy tree, wherein M is the number of the types of the epitopes, β_mIs the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T, theta_mAre parameters of the decision tree.

Technical Field

The invention belongs to the technical field of data science with accurate medicine as an application background, and particularly relates to a T cell receptor corresponding epitope prediction method based on multiconnector characteristics.

Background

Specific binding of T Cell Receptor (TCR) and epitope (MHC) with Major Histocompatibility Complex (MHC) activates the immune system, thereby triggering a series of specific immune responses. Immunotherapy is based on the property of specific immune system, and by developing corresponding agents, the immune system is artificially activated, so that the immune system of the body can work again to eliminate invaders or cancer cells in the body. Therefore, the prediction of the corresponding epitope of the TCR can provide an important theoretical basis for the fields of exploring disease mechanisms, cancer immunotherapy, drug development, vaccine manufacture and the like.

Although second Generation Sequencing technology (hereinafter referred to as NGS) provides a huge amount of nucleotide and amino acid sequences, labeling cost and time are high, and labeling data are still few at present. If a relatively reliable prediction model can be trained from a small amount of current labeled data, the method can be applied to the labeling problem of the TCR epitope, and a large amount of time and economic cost are saved. In addition, since the gene segments of the TCR are obtained by a series of non-homologous recombinations involving the combination of TCR loci and random nucleotide insertions and/or deletions from the variable (V), diversity (D) and joining (J) gene segments, a large number of different TCRs can be produced, up to a scale of 10¹⁵～10⁶¹. In addition, one TCR can recognize multiple epitopes while maintaining the ability to simultaneously detect multiple epitopes due to the presence of cross-reactivityOne epitope can also recognize multiple TCRs. It is difficult to find the matching pattern of TCR and pMHC from such data manually and statistically, and it is of great significance in the course of immunotherapy if the specific binding mechanism of TCR and pMHC can be studied by machine learning algorithms.

CDR1, CDR2, CDR2.5 and CDR3, the antigen specificity recognition depends on CDR area, wherein CDR3 area diversity is the highest, mainly combines with epitope peptide chain, CDR1, CDR2 and CDR2.5 mainly combines with MHC molecule, but also can combine with peptide chain.

At present, researchers at home and abroad try to research the relationship between CDR3 and epitope data, and the method can be roughly divided into two types, namely a TCR or CDR3 sequence similarity measurement method is defined by a first type of use method, after the similarity between sequences is obtained, a simple classifier such as a K-nearest neighbor (English abbreviation: K-nn) algorithm is used for classification, and a second type of method is used for extracting the physicochemical characteristics of amino acids based on the TCR or CDR3 sequence or coding the amino acid sequence based on a B L OSUM matrix, and then a prediction model is obtained by machine learning model training.

However, the prediction performance of the two methods is not good, and the following problems mainly exist: first, the first method requires computation of similarity between any two TCR sequences, and thus the time complexity of similarity computation is O (n)²) The training process is time consuming. Second, the second method is basically based on amino acid encoding, and since the different CDR3 sequences are not necessarily equal in length, alignment is required to ensure that the feature vectors of each TCR sequence have the same dimension. Third, the first method mainly considers the overall similarity of two TCR sequences, and the second method mainly considers each of the sequencesAmino acid information, no approach has been taken into account for the role played by the information provided by adjacent amino acids in the TCR sequence in the specific recognition of TCRs and epitopes.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a T cell receptor corresponding epitope prediction method based on multiconnector characteristics aiming at the defects in the prior art, so as to solve the problem of complicated and complicated characteristic extraction; the problem of time consumption of model training is solved, and the model training can be completed in a short time; multi-class prediction can be directly performed.

The invention adopts the following technical scheme:

a T cell receptor corresponding epitope prediction method based on a concatemer characteristic comprises the following steps:

s1, analyzing the CDR3 β chain and the corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as initial characteristics;

s4, inputting the characteristic data of the step S2 into the model trained in the step S3 for prediction, and selecting different prediction indexes according to different prediction purposes.

Specifically, step S2 specifically includes:

s201, recording the initial characteristic matrix as: x ═ X₁,x₂,...,x_nCentering each column of features;

s203, solving the optimized target part by using a Lagrange multiplier methodSolution, to covariance matrix XX^TPerforming characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W^TX is a matrix of k rows and n columns.

Further, in step S201, m-dimensional column vector x₁Comprises the following steps:

wherein n is the number of training samples and m is the feature dimension.

Further, in step S202, the optimization objective is:

where W is the transformation matrix, W^TIs the transpose of the transformation matrix, X is the initial feature matrix, X^TIs the transpose of the initial feature matrix.

Further, in step S203, the optimization objective is solved to obtain

XX^TW＝λW

The projection matrix W is:

W＝(w₁,w₂,...,w_k)

Specifically, step S3 specifically includes:

s301, initializing the iteration number M to be 0, setting the maximum iteration number to be M, and initializing a model f₀(x)；

S302, adding a decision tree on the basis of the current model in each model iteration, and using residual L (y, f)_m-1(x) Estimate parameter Θ)_m；

Further, in step S301, model f is initialized₀(x) Comprises the following steps:

where N is the number of samples, c is the constant of the initial model fit, L is the log-likelihood loss function defined as:

Further, in step S302, the result of the mth iteration is:

f_m(x)＝f_m-1(x)+β_mT(x；Θ_m)

wherein f is_m-1(x) Is the decision model for the m-1 th iteration, using all R_miSet of (2)i∈[1..n]To fit a regression classification decision tree.

Further, residual L (y, f) is used_m-1(x) Estimate parameter Θ)_mParameter theta of decision Tree_mThe method is obtained by solving the following optimization objectives:

loss function in model f_m-1The negative gradient above is used to approximate the estimate residual as:

where i is the index of the ith training sample.

Further, in step S303,

wherein f is_M(x) For the final integrated model consisting of M decision trees, M being the number of epitope classes, β_mIs the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T, theta_mAre parameters of the decision tree.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a method for predicting TCR epitope based on combined concatemer characteristics in a TCR sequence, which comprises the steps of scanning CDR3 β sequences one by one, analyzing polypeptide chains into continuous short peptide chains with the length of 3, counting the occurrence frequency of each triplet, taking a statistical result as an initial characteristic matrix, and taking the epitope corresponding to the CDR3 β sequence as a class label.

Furthermore, principal component analysis is used for feature transformation, and the dimensionality of the features is reduced.

Further, the feature matrix is input into a Gradient Boosting Decision Tree (GBDT) for training, the optimal parameters of the model are obtained through grid search, and finally the multiple decision trees are obtained.

Furthermore, the test data is coded by the same method, the test data characteristic matrix is input into the model, and the sum of the prediction results of all the decision trees is taken as the final prediction result.

In conclusion, the method only uses the statistic value of the triplet as the initial characteristic, and combines the gradient lifting decision tree model to complete the training of the model in a very short time, and the prediction accuracy is higher.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

Fig. 1 is a feature matrix obtained after feature selection is performed on TCR data in Dash et.

FIG. 2 is a schematic flow chart of the present invention;

FIG. 3 shows comparison results of different models in a Dash dataset;

FIG. 4 shows the results of a multi-classification ROC curve performed in the Dash dataset.

Detailed Description

The invention provides a TCR Epitope prediction method based on adjacent amino acid information in a TCR Sequence, which is named as SETE (Sequence-based end learning approach for TCR Epitope binding prediction), the data of a training set is a CDR3 β Sequence and a corresponding polypeptide chain capable of carrying out specific recognition, and the data of a test set is a CDR3 β Sequence.

Based on the following general consensus in academia:

the CDR3 region of the TCR sequence has a clear interaction with MHC-presented polypeptide chains, and the β chain of this region contributes significantly in peptide recognition;

2. the number of amino acids constituting proteins in humans is 20.

Referring to fig. 2, the method for predicting corresponding epitopes of T cell receptors based on the concatemer characteristics of the present invention includes the following steps:

s1, extracting initial characteristics

As the input amino acid sequence can not be directly utilized as the characteristic, the input amino acid sequence needs to be analyzed into bases with the length of 3, the frequency of each triplet is counted as an initial characteristic, after the obtained initial characteristic is selected, certain similarity can be found among TCR sequence characteristics corresponding to different categories of epitopes, a characteristic matrix obtained after the TCR data in the Dash et.

S2, feature extraction

On the one hand, since there are a total of 20 amino acids, a short chain of 3 amino acids will have a maximum of 20³The method comprises the following steps of (1) carrying out seed combination, so that the features can reach 8000 dimensions at most, and feature screening is needed to reduce the dimension of the features; second, because of the similarity between similar TCR sequences, there may be redundant information between triplets of TCR sequences of the same class. Therefore, the method for reducing the dimension of the data by using the principal component analysis specifically comprises the following steps:

s201, recording the initial characteristic matrix as: x ═ X₁,x₂,...,x_nCentering each column of features;

m dimensional column vector x₁Comprises the following steps:

wherein n is the number of training samples, and m is the feature dimension;

where W is the transformation matrix, W^TIs the transpose of the transformation matrix, X is the initial feature matrix, X^TIs the transpose of the initial feature matrix.

S203, solving the optimized target part by using a Lagrange multiplier method to obtain XX^TW＝λW，For covariance matrix XX^TPerforming characteristic decomposition, and sequencing the obtained characteristic values: lambda [ alpha ]₁≥λ₂≥...≥λ_nThen, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W is obtained^TX is a matrix with k rows and n columns;

the projection matrix W is:

W＝(w₁,w₂,...,w_k)

wherein λ is a characteristic value, w_iIs the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k.

S3 epitope prediction model training

A new prediction model based on a gradient lifting decision tree is provided; if n training samples exist, after prediction data x is input, the gradient lifting decision tree model makes a prediction by linearly combining decision results of all decision trees, and the method specifically comprises the following steps:

the n training samples are:

{(x₁,y₁),...,(x_n,y_n)}

wherein the content of the first and second substances,i＝1，2，...，n；

s301, model initialization

Initializing the iteration number M to be 0, setting the maximum iteration number to be M, and initializing the model f₀(x) Comprises the following steps:

where N is the number of samples, c is the constant of the initial model fit, L is the log-likelihood loss function defined as:

S302, model iteration

Each iteration of the model adds a decision tree on the basis of the current model, and the result of the mth iteration is as follows:

f_m(x)＝f_m-1(x)+β_mT(x；Θ_m)

wherein f is_m-1(x) Is the decision model of the (m-1) th iteration, the parameter theta of the decision tree_mThe method is obtained according to the following optimization target solution:

since the basis functions are linearly additive, the goal is to use the residual L (y, f)_m-1(x) Estimate parameter Θ)_m。

For this purpose, the loss function is in the model f_m-1The negative gradient above is used to approximate the estimate residual.

Where i is the index of the ith training sample.

Using all R_miSet of (2)i∈[1..n]To fit a Regression Classification decision Tree (English name: Classification and Regression Tree, English abbreviation: CART), and solve the parameter theta_m。

S303, assigning m to be m +1, and if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training and returning to all the decision trees of the training;

s4 epitope prediction

Extracting initial features and features by the same method, inputting final data into a trained model for prediction, and selecting different prediction indexes according to different prediction purposes.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a T cell receptor corresponding epitope prediction method based on multiconnector characteristics, and solves the problems that the existing algorithm is long in training time and the prediction result is not ideal.

Because no model can directly carry out the TCR epitope multi-classification prediction problem at present, in order to verify the effectiveness of the invention, the prediction effect of the two classifications is firstly tested. Since the TCRGP of the existing method uses the working characteristics (ROC) of a testee and the Area Under the ROC Curve line (AUC) as the evaluation indexes of the model, the AUC is used for evaluating the prediction performance of the invention.

In addition, the run times of the two models on the same data set are compared; then, a multi-classification prediction test is carried out, and since the data sample amount of each class is unbalanced, the ROC is less influenced by the data unbalance, and therefore the prediction performance of the model is still measured by using the ROC and AUC indexes. Index name: true Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

The false positive rate FPR ═ FP/(FP + TN) is defined.

The true positive ratio TPR is defined as TP/(TP + FN).

The ROC curve is plotted from the values of FPR and TPR at different thresholds, and AUC is the area under the line of the ROC curve.

Tests were performed on the common data set VDJdb. Through screening, 22 types of table data are selected from VDJdb data. Because the existing models can only process the two-classification task, the two-classification test is firstly carried out for comparison with other models. In each binary task, all positive case data is used, and equal amounts of TCR data are randomly sampled from the other classes as negative cases. The results of the two classifications are shown in Table 1.

TABLE 1 SETE vs TCRGP dichotomy results (. star: FRDYVDRFYKT L RAEQASQE)

From the above table, in the binary task, compared with the existing method TCRGP, the prediction effect of the present invention is equivalent, but the time consumption is significantly shortened, and the training time is greatly reduced, which is especially obvious on the data set with large data volume.

In the multi-classification task, the invention also carries out a series of experiments to verify the effectiveness. A ROC curve is used as an index of an evaluation model, a OneVsRest strategy is used for drawing a multi-classification ROC curve, a classifier is trained for data of each class, one class of TCR sequences is regarded as a positive example by each classifier, other classes of TCR sequences are regarded as negative examples by other classes of TCR sequences, and finally output results of ten classifiers are voted to obtain a final classification result. The results obtained using the five-fold cross-validation are shown in table 2.

Table 2: multi-classification prediction results of SETE on VDJdb dataset

To further validate the ability of the present invention to predict TCR-corresponding epitopes, tests were performed in a dataset published in the Dash et al paper, which collected epitope data for both class 3 human and class 7 mouse.

Since the model is more suitable for multi-classification tasks, multi-classification tests are first performed on the data set, the ROC curve and the AUC result are used to evaluate the model effect, and the multi-classification results in the Dash data set are shown in Table 3.

Table 3: SETE multiple classification results in Dash dataset

From the above table, it is known that SETE performs well in the whole of the multi-category problem, and the prediction result on individual epitope genes is poor, such as pp65, which may have a certain relationship with the small data size of the epitope genes of this type. The comparison results of SETE multi-classification and TCRGP and TCRdist are shown in FIG. 3, the x axis represents different prediction models, and the y axis represents the area under the ROC curve of each model; in addition, ROC curves for human and mouse data were plotted for multiple classifications, respectively, and the results are shown in fig. 4. In the figure, the x-axis represents the false positive rate and the y-axis represents the true positive rate.

Two classification tests were performed on the Dash dataset and the prediction results for the two classifications are shown in table 4.

Table 4: comparison of results of two classifications of SETE and TCRGP in Dash dataset

As with previous results, SETE can complete training in a very short time and the prediction accuracy is better than the TCRGP model.

In conclusion, compared with the existing method TCPGP, the method can complete the training of the model in shorter time, and the performance in the binary task is better than that of the existing method. In addition, the method can be directly applied to multi-classification tasks, and the prediction accuracy is high.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于稳健线性回归的染色体拷贝数变异判别方法及装置

T cell receptor corresponding epitope prediction method based on multiconnector characteristics

相关技术

网友询问留言