Characteristic mRNA expression profile combination and lung squamous cell carcinoma early prediction method

文档序号:1083468 发布日期:2020-10-20 浏览:6次 中文

阅读说明:本技术 一种特征mRNA表达谱组合及肺鳞癌早期预测方法 (Characteristic mRNA expression profile combination and lung squamous cell carcinoma early prediction method ) 是由 高跃东 李文兴 于 2020-08-04 设计创作,主要内容包括:本发明公开了一种特征mRNA表达谱组合及肺鳞癌早期预测方法,所述mRNA核苷酸探针序列如SEQ ID NO.1-20所示。本发明的基于mRNA表达谱组合特征评估肺鳞癌早期风险具有很高的精确度和准确率(ROC曲线下面积AUC=0.994)。只需要获取上述20种mRNA的相对表达量,通过支持向量机模型计算给出肺鳞癌早期患病概率,可作为肺鳞癌早期预测的参考依据。(The invention discloses a characteristic mRNA expression profile combination and an early lung squamous carcinoma prediction method, wherein the mRNA nucleotide probe sequence is shown as SEQ ID NO. 1-20. The method has high precision and accuracy in assessing the early risk of the squamous cell lung carcinoma based on mRNA expression profile combination characteristics (AUC is 0.994). The relative expression quantity of the 20 mRNAs is only required to be obtained, and the early stage morbidity probability of the lung squamous cell carcinoma is calculated and given through a support vector machine model, so that the early stage morbidity probability can be used as a reference basis for early stage prediction of the lung squamous cell carcinoma.)

1. A combination of characteristic mRNA expression profiles comprising A2M, ALDH2, ASAH1, CAT, CD55, CD74, CENPH, CTSH, CTSO, ECT2, HSD17B11, KANK2, MYL9, PECAM1, PERP, RFC4, SERPING1, SLC2A1, TGFBR2 and VIM, the nucleotide probe sequences of which are shown in SEQ ID No. 1-20.

2. A method for early prediction of squamous cell lung carcinoma based on the combination of characteristic mRNA expression profiles according to claim 1, comprising the steps of:

step 1, obtaining characteristic mRNA stably and differentially expressed by patients with early lung squamous carcinoma;

step 2, selecting characteristic mRNA expression data, and carrying out data standardization on each sample;

step 3, constructing an early prediction model for the standardized data by using a support vector machine;

step 4, early prediction is carried out according to the expression level of the mRNA which is characteristic of the patient;

the method is for non-disease diagnostic and therapeutic purposes.

3. The method for predicting early lung squamous carcinoma according to claim 2, wherein the characteristic mRNA stably and differentially expressed by the patient with early lung squamous carcinoma obtained in the step 1 comprises:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of the patient with squamous cell lung cancer from a Genomic Data common Data Portal database to obtain a tumor tissue gene expression profile read counts value of the patient with squamous cell lung cancer, namely a sequencing read value, and carrying out logarithmic conversion;

step 1.2, selecting mRNA with certain expression abundance, namely, reading counts of the mRNA in all samples are more than or equal to 10; taking logarithm of read counts of all mRNA, setting the total number of samples as n, the total number of screened mRNA as m, v as read counts of mRNA, and u as expression value after taking logarithm, then:

uij=log2vij,i∈(1,n),j∈(1,m) (1)

wherein i is the sample number, j is the mRNA number, uijThe expression value after taking the logarithm of the ith sample and the jth mRNA number, vijRead counts values for sample i, mRNA j number;

step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting mRNA stably expressed in the tumor sample and the normal sample, namely mRNA with the variation coefficient smaller than 0.1 in the tumor sample and the normal sample, setting mu as the expression mean value of the mRNA in all samples, setting sigma as standard deviation, and calculating the variation coefficient according to the formula:

Figure FDA0002617847380000021

wherein j is the mRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejIs the standard deviation of the jth mRNA number, μjThe expression average of the mRNA numbered by the jth mRNA is set as m1For the total number of stably expressed mrnas, there are:

step 1.5, mRNA which is differentially expressed in a tumor sample and a normal sample is selected; the logarithmized expression values were used to calculate the log-oriented fold change f of the tumor and normal sample mrnas, and the formula is:

Figure FDA0002617847380000023

wherein j is the mRNA number, fjFold change for jth mRNA numbering,. mu.1jExpression mean, μ, of tumor samples numbered for the jth mRNA2jExpression mean of the normal sample numbered for the jth mRNA;

the expression difference of mRNA in tumor and normal samples was then compared using independent sample t-test, which was formulated as:

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean tumor sample mRNA expression, μ2Is the mean value of the mRNA expression of a normal sample,the variance of the mRNA in the tumor sample is obtained,

Figure FDA0002617847380000034

correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions in the individual mRNAs are:

Figure FDA0002617847380000031

wherein j is the mRNA number, qjRepresents the FDR corrected value of the jth mRNA number, pjP-value, r, from t-test representing the jth mRNA numberjP-value at m representing the jth mRNA number1The sequenced position in the individual mRNA;

finally selecting mRNA with the absolute value of fold change D larger than 1 and q value smaller than or equal to 0.05 after FDR correction, marking as characteristic mRNA, and setting the total number of the characteristic mRNA as m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)。

4. the method for early stage prediction of squamous cell lung carcinoma according to claim 2, wherein the characteristic mRNA expression data is selected in step 2, and the data is normalized for each sample by the formula:

Figure FDA0002617847380000032

wherein i is the sample number and j is the feature mRNA number; mu.siMean, σ, of all characteristic mRNA expressions of the ith sampleiFor all characteristic mRNA standard deviations, u, of the ith sampleijFor logarithmic characteristic mRNA expression values, uij' is the normalized mRNA value.

5. The method for early predicting squamous cell lung carcinoma according to claim 2, wherein the step 3 uses a support vector machine to construct an early prediction model for the normalized data, specifically:

step 3.1, grouping all samples, dividing 80% of all samples into a training set and a verification set, and dividing the rest 20% of all samples into a test set; the training set and the verification set are used for 5-fold cross verification, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set; parameters are given, a training set is used for constructing a model, and a verification set is used for checking the accuracy of the model;

step 3.2, optimal parameter screening, wherein the parameter gamma in the SVM controls the width of a Gaussian kernel, and C is a regularization parameter and limits the importance of each point; the parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in the cross validation, a model is constructed by sequentially using the combination of every two parameters gamma and C, and then the accuracy of the model is checked by using a validation set; for each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. Selecting a parameter combination with the highest average accuracy of 5 times of verification as an optimal parameter;

3.3, constructing a model by using the optimal parameters and data of the training set and the verification set, and finally evaluating the model by using the test set, wherein evaluation indexes comprise accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Matthews Correlation Coefficient (MCC) and area under the Receiver Operating Curve (ROC) (AUC); in the test set, defining the tumor count as True Positive (TP), the tumor count as normal but predicted as False Positive (FP), the tumor count as true but predicted as normal False Negative (FN), the tumor count as normal but predicted as True Negative (TN); the above evaluation index calculation formula is:

Figure FDA0002617847380000041

Figure FDA0002617847380000042

Figure FDA0002617847380000043

Figure FDA0002617847380000054

the accuracy, recall, specificity, F1 score and AUC returned values between (0, 1) in the above evaluation indices. The higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier. Therefore, the closer the above index is to 1, the better the overall prediction effect of the model is;

step 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect; the final prediction model is constructed with the optimal parameter combinations using all the data.

6. The method for early prediction of squamous cell lung carcinoma according to claim 2, wherein the early prediction in step 4 is performed according to the expression level of mRNA characteristic of the patient, and specifically comprises:

step 4.1, standardizing the characteristic mRNA expression data of the prediction sample, setting u as the characteristic mRNA expression value of the prediction sample, setting mu as the characteristic mRNA expression mean value of the prediction sample, setting sigma as the standard deviation of the characteristic mRNA of the prediction sample, and adopting the following formula:

wherein j is the characteristic mRNA number, uj' is the normalized mRNA value;

step 4.2, substituting the mRNA value after the prediction sample is standardized into the final prediction for prediction; a prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

Technical Field

The invention belongs to the field of biotechnology and medicine, and particularly relates to a characteristic mRNA expression profile combination and an early lung squamous cell carcinoma prediction method.

Background

Squamous cell carcinoma of the lung (lung squamous cell carcinoma), accounts for 40% -51% of primary lung cancer. Squamous cell lung cancer is commonly seen in old men and has close relation with smoking. Squamous cell lung cancer is common in central lung cancer, and tends to grow in the chest cavity, and early squamous cell lung cancer often causes bronchoconstriction or obstructive pulmonary inflammation. Global Disease burden (GBD) data shows that the number of people with trachea, bronchi or lung cancer in 2017 is over 330 ten thousand globally, wherein the number of people with lung cancer in china is as high as 127 ten thousand. The number of deaths with the above cancers worldwide in 2016 was 188 ten thousand, accounting for 3.37% of the total deaths. The number of deaths in 2016 in China is 69 thousands, accounting for 6.62% of the total deaths. Statistics show a continuous increase in the prevalence and mortality of tracheal, bronchial and lung cancer worldwide from 1990 to 2017. The prevalence and mortality rates in china have increased year by year over the last decade and are growing at a rate higher than the global average.

A Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, and a decision boundary of the SVM is a maximum edge distance hyperplane for solving a learning sample. The SVM model represents instances as points in space, so that the mapping is such that instances of the individual classes are separated by as wide an apparent interval as possible. The new instances are then mapped to the same space and the categories are predicted based on which side of the interval they fall on. When the training data is linearly separable, the SVM is classified by hard interval maximization learning. When the training data is linearly non-separable, the SVM is classified by using a kernel technique and soft interval maximization learning. SVMs are powerful for medium-sized data sets with similar meaning of features and are also suitable for small data sets. In general, the prediction effect is good for the SVM data set with the sample size less than 1 ten thousand. SVM has a wide range of applications in disease diagnosis, tumor classification, tumor gene recognition, and the like.

Early diagnosis of tumors has been a difficult problem in the medical community. The existing early diagnosis methods mostly observe the expression level of a certain marker or a class of markers, and the ideal diagnosis effect is difficult to achieve. Since the expression profiles of these markers in tumor patients and normal populations partially overlap, it is difficult to define a cut-off for the markers that better separates tumor patients from normal populations. Therefore, the use of multiple marker expression signature combinations may be an effective method for early diagnosis of tumors. Messenger RNA (mRNA) is a single-stranded ribonucleic acid that is transcribed from a single strand of DNA as a template and carries genetic information that directs protein synthesis. Tumor tissues often show a large number of mRNA disorders compared to normal tissues, and studies have shown that these mRNA disorders are closely related to tumor occurrence, pathological mechanisms and prognosis status. However, it is difficult to define the critical value for early prediction due to the overlapping distribution of single mRNA molecules expressed in tumor and normal human populations.

Therefore, there is a need to establish a more stable predictive model of multiple differential mRNA expression signature combinations that facilitates early prediction of squamous cell lung carcinoma.

Disclosure of Invention

In view of the above, the present invention provides a combination of characteristic mRNA expression profiles and a method for early stage lung squamous cell carcinoma prediction, which can accurately predict stage I/II lung squamous cell carcinoma.

In order to solve the technical problems, the invention discloses a characteristic mRNA expression profile combination, which comprises A2M, ALDH2, ASAH1, CAT, CD55, CD74, CENPH, CTSH, CTSO, ECT2, HSD17B11, KANK2, MYL9, PECAM1, PERP, RFC4, SERPING1, SLC2A1, TGFBR2 and VIM, wherein the nucleotide probe sequences are shown in SEQ ID NO. 1-20.

The invention also discloses a lung squamous carcinoma early stage prediction method based on the characteristic mRNA expression profile combination, which comprises the following steps:

step 1, obtaining characteristic mRNA stably and differentially expressed by patients with early lung squamous carcinoma;

step 2, selecting characteristic mRNA expression data, and carrying out data standardization on each sample;

step 3, constructing an early prediction model for the standardized data by using a support vector machine;

step 4, early prediction is carried out according to the expression level of the mRNA which is characteristic of the patient;

the method is for non-disease diagnostic and therapeutic purposes.

Optionally, the step 1 of obtaining characteristic mrnas stably and differentially expressed by the patient with the lung squamous carcinoma at the early stage is specifically as follows:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of the patient with squamous cell lung cancer from a Genomic Data common Data Portal database to obtain readcounts numerical values of tumor tissue gene expression profiles of the patient with squamous cell lung cancer, namely sequencing read numerical values, and carrying out logarithmic conversion;

step 1.2, selecting mRNA with certain expression abundance, namely, reading counts of the mRNA in all samples are more than or equal to 10; taking logarithm of read counts of all mRNA, setting the total number of samples as n, taking the total number of screened mRNA as m, v as the read counts of the mRNA, and u as an expression value after taking logarithm, and then obtaining the result;

uij=log2vij,i∈(1,n),j∈(1,m) (1)

wherein i is the sample number, j is the mRNA number, uijThe expression value after taking the logarithm of the ith sample and the jth mRNA number, vijRead counts values for sample i, mRNA j number;

step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting mRNA stably expressed in the tumor sample and the normal sample, namely mRNA with the variation coefficient smaller than 0.1 in the tumor sample and the normal sample, setting mu as the expression mean value of the mRNA in all samples, setting sigma as standard deviation, and calculating the variation coefficient according to the formula:

wherein j is the mRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejIs the standard deviation of the jth mRNA number, μjThe expression average of the mRNA numbered by the jth mRNA is set as m1For the total number of stably expressed mrnas, there are:

Figure BDA0002617847390000032

step 1.5, mRNA which is differentially expressed in a tumor sample and a normal sample is selected; the logarithmized expression values were used to calculate the log-oriented fold change f of the tumor and normal sample mrnas, and the formula is:

wherein j is the mRNA number, fjFold change for jth mRNA numbering,. mu.1jExpression mean, μ, of tumor samples numbered for the jth mRNA2jExpression mean of the normal sample numbered for the jth mRNA;

the expression difference of mRNA in tumor and normal samples was then compared using independent sample t-test, which was formulated as:

Figure BDA0002617847390000042

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean tumor sample mRNA expression, μ2Is the mean value of the mRNA expression of a normal sample,

Figure BDA0002617847390000043

the variance of the mRNA in the tumor sample is obtained,mRNA variance for normal samples;

correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions in the individual mRNAs are:

wherein j is the mRNA number, qjRepresents the FDR corrected value of the jth mRNA number, pjP-value, r, from t-test representing the jth mRNA numberjP-value at m representing the jth mRNA number1The sequenced position in the individual mRNA;

finally, selectingmRNA having a fold change f of more than 1 in absolute terms and a value q of 0.05 or less after FDR correction was designated as characteristic mRNA, and the total number of characteristic mRNAs was m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)

optionally, the characteristic mRNA expression data in step 2 is selected, and data normalization is performed on each sample, where the formula is:

Figure BDA0002617847390000051

wherein i is the sample number and j is the feature mRNA number; mu.siMean, σ, of all characteristic mRNA expressions of the ith sampleiFor all characteristic mRNA standard deviations, u, of the ith sampleijFor logarithmic characteristic mRNA expression values, uij' is the normalized mRNA value.

Optionally, the step 3 of constructing an early prediction model for the normalized data by using a support vector machine specifically includes:

step 3.1, grouping all samples, dividing 80% of all samples into a training set and a verification set, and dividing the rest 20% of all samples into a test set; the training set and the verification set are used for 5-fold cross verification, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set; parameters are given, a training set is used for constructing a model, and a verification set is used for checking the accuracy of the model;

step 3.2, optimal parameter screening, wherein the parameter gamma in the SVM controls the width of a Gaussian kernel, and C is a regularization parameter and limits the importance of each point; the parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in the cross validation, a model is constructed by sequentially using the combination of every two parameters gamma and C, and then the accuracy of the model is checked by using a validation set; for each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. Selecting a parameter combination with the highest average accuracy of 5 times of verification as an optimal parameter;

3.3, constructing a model by using the optimal parameters and data of the training set and the verification set, and finally evaluating the model by using the test set, wherein evaluation indexes comprise accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Matthews Correlation Coefficient (MCC) and area under the Receiver Operating Curve (ROC) (AUC); in the test set, defining the tumor count as True Positive (TP), the tumor count as normal but predicted as False Positive (FP), the tumor count as true but predicted as normal False Negative (FN), the tumor count as normal but predicted as True Negative (TN); the above evaluation index calculation formula is:

Figure BDA0002617847390000062

Figure BDA0002617847390000064

Figure BDA0002617847390000066

the accuracy, recall, specificity, F1 score and AUC returned values between (0, 1) in the above evaluation indices. The higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier. Therefore, the closer the above index is to 1, the better the overall prediction effect of the model is;

step 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect; the final prediction model is constructed with the optimal parameter combinations using all the data.

Optionally, the early prediction in step 4 is performed according to the expression level of mRNA characteristic of the patient, specifically:

step 4.1, standardizing the characteristic mRNA expression data of the prediction sample, setting u as the characteristic mRNA expression value of the prediction sample, setting mu as the characteristic mRNA expression mean value of the prediction sample, setting sigma as the standard deviation of the characteristic mRNA of the prediction sample, and adopting the following formula:

wherein j is the characteristic mRNA number, uj' is the normalized mRNA value;

step 4.2, substituting the mRNA value after the prediction sample is standardized into the final prediction for prediction; a prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

Compared with the prior art, the invention can obtain the following technical effects:

1) the prediction speed is high: the prediction model constructed by the invention can be used for rapidly predicting large-scale samples, and the prediction time of 100 samples only needs a few seconds.

2) The accuracy is high: the prediction model constructed by the method has high prediction accuracy and accuracy which are both over 90 percent, and the AUC of the area under the ROC curve can reach 0.994.

3) Platform heterogeneity impact is minor: because mRNA expression values measured by different analysis platforms have large difference, the invention predicts and uses normalized characteristic mRNA expression values, and is less influenced by platform heterogeneity.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of data screening and model building according to the present invention;

FIG. 2 is a cross-validation parameter optimization process for a support vector machine model according to the present invention;

FIG. 3 is a diagram of a test set evaluation index for a support vector machine model according to the present invention;

FIG. 4 is a support vector machine model test set ROC curve of the present invention.

Detailed Description

The following embodiments are described in detail with reference to the accompanying drawings, so that how to implement the technical features of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention discloses a lung squamous carcinoma early stage prediction method based on characteristic mRNA expression profile combination, which can accurately predict the I/II stage of lung squamous carcinoma and comprises the following steps:

step 1, obtaining mRNA (characteristic mRNA) stably and differentially expressed by a patient with early lung squamous carcinoma, specifically:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of the patient with squamous cell lung cancer from a Genomic Data common Data Portal database to obtain readcounts numerical values of tumor tissue gene expression profiles of the patient with squamous cell lung cancer, namely sequencing read numerical values, and carrying out logarithmic conversion;

step 1.2, selecting mRNA with certain expression abundance, namely the read counts of the mRNA in all samples are more than or equal to 10. Taking logarithm of read counts of all mRNA, setting the total number of samples as n, taking the total number of screened mRNA as m, v as the read counts of the mRNA, and u as an expression value after taking logarithm, and then obtaining the result;

uij=log2vij,i∈(1,n),j∈(1,m) (1)

wherein i is the sample number, j is the mRNA number, uijThe expression value after taking the logarithm of the ith sample and the jth mRNA number, vijRead counts values for the ith sample, jth mRNA number.

Step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting mRNA stably expressed in the tumor sample and the normal sample, namely mRNA with the variation coefficient smaller than 0.1 in the tumor sample and the normal sample, setting mu as the expression mean value of the mRNA in all samples, setting sigma as standard deviation, and calculating the variation coefficient according to the formula:

wherein j is the mRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejIs the standard deviation of the jth mRNA number, μjThe expression average of the mRNA numbered by the jth mRNA is set as m1For the total number of stably expressed mrnas, there are:

Figure BDA0002617847390000092

step 1.5, mRNA which is differentially expressed in tumor samples and normal samples is selected. The logarithmized expression values were used to calculate the log-oriented fold change f of the tumor and normal sample mrnas, and the formula is:

Figure BDA0002617847390000093

wherein j is the mRNA number, fjFold change for jth mRNA numbering,. mu.1jExpression mean, μ, of tumor samples numbered for the jth mRNA2jThe expression mean of the j-th mRNA-numbered normal samples.

The expression difference of mRNA in tumor and normal samples was then compared using independent sample t-test, which was formulated as:

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean tumor sample mRNA expression, μ2Is the mean value of the mRNA expression of a normal sample,

Figure BDA0002617847390000095

the variance of the mRNA in the tumor sample is obtained,is the normal sample mRNA variance.

Correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions in the individual mRNAs are:

wherein j is the mRNA number, qjRepresents the FDR corrected value of the jth mRNA number, pjP-value, r, from t-test representing the jth mRNA numberjP-value at m representing the jth mRNA number1The sequenced position in individual mRNAs.

Finally selecting the multiple change fmRNA having an absolute value of more than 1 and a q value of 0.05 or less after FDR correction was designated as characteristic mRNA, and the total number of characteristic mRNAs was defined as m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)

step 2, selecting characteristic mRNA expression data, and carrying out data standardization on each sample, wherein the formula is as follows:

where i is the sample number and j is the characteristic mRNA number. Mu.siMean, σ, of all characteristic mRNA expressions of the ith sampleiFor all characteristic mRNA standard deviations, u, of the ith sampleijFor logarithmic characteristic mRNA expression values, uij' is the normalized mRNA value.

Step 3, constructing an early prediction model for the standardized data by using a support vector machine, specifically:

and 3.1, grouping all samples. 80% of all samples are divided into training set + validation set, and the remaining 20% are divided into test set. The training set and the verification set are used for 5-fold cross validation, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set. Given the parameters, the training set is used to construct the model, and the validation set is used to verify the accuracy of the model.

And 3.2, screening the optimal parameters. The parameter gamma in the SVM controls the width of the Gaussian kernel, and C is a regularization parameter, limiting the importance of each point. The parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in cross-validation, the model is constructed using a combination of every two parameters gamma and C in turn, and then the validation set is used to verify the model accuracy. For each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. And selecting the parameter combination with the highest average accuracy of 5 times of verification as the optimal parameter.

And 3.3, constructing a model by using the optimal parameters and the data of the training set and the verification set, and finally evaluating the model by using the test set. The evaluation index includes accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mathematic Correlation Coefficient (MCC), and area under the subject operating curve (ROC) (AUC). In the test set, the tumor counts are defined as True Positive (TP), normal but predicted tumor counts as False Positive (FP), tumor counts as False Negative (FN), and normal and predicted as True Negative (TN). The above evaluation index calculation formula is:

Figure BDA0002617847390000111

Figure BDA0002617847390000112

Figure BDA0002617847390000113

Figure BDA0002617847390000114

Figure BDA0002617847390000115

the accuracy, recall, specificity, F1 score and AUC returned values between (0, 1) in the above evaluation indices. The higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier. Therefore, the closer the above index is to 1, the better the prediction effect of the entire model is.

And 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect. The final prediction model is constructed with the optimal parameter combinations using all the data.

And 4, carrying out early prediction according to the expression level of the mRNA characteristic of the patient, specifically comprising the following steps:

step 4.1, standardizing the characteristic mRNA expression data of the prediction sample, setting u as the characteristic mRNA expression value of the prediction sample, setting mu as the characteristic mRNA expression mean value of the prediction sample, setting sigma as the standard deviation of the characteristic mRNA of the prediction sample, and adopting the following formula:

Figure BDA0002617847390000123

wherein j is the characteristic mRNA number, uj' is the normalized mRNA value.

And 4.2, substituting the mRNA value after the prediction sample is normalized into the final prediction for prediction. A prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

30页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种特征miRNA表达谱组合及肺鳞癌早期预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!