Characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method

文档序号:1083469 发布日期:2020-10-20 浏览:6次 中文

阅读说明:本技术 一种特征miRNA表达谱组合及肺鳞癌早期预测方法 (Characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method ) 是由 高跃东 李文兴 于 2020-08-04 设计创作,主要内容包括:本发明公开了一种特征miRNA表达谱组合及肺鳞癌早期预测方法,所述特征miRNA表达谱组合的核苷酸序列如SEQ ID NO.1-30所示。所述方法包括以下步骤:获取肺鳞癌早期患者稳定差异表达的miRNA;选取特征miRNA表达数据,对每个样本进行数据标准化;使用支持向量机对标准化后的数据构建早期预测模型;根据患者特征miRNA的表达水平进行早期预测。本发明的特征miRNA表达谱组合评估肺鳞癌早期风险具有很高的精确度和准确率(ROC曲线下面积AUC=0.994)。只需要获取上述30种miRNA的相对表达量,通过支持向量机模型计算给出肺鳞癌早期患病概率,可作为肺鳞癌早期预测的参考依据。(The invention discloses a characteristic miRNA expression profile combination and an early lung squamous carcinoma prediction method, wherein the nucleotide sequence of the characteristic miRNA expression profile combination is shown as SEQ ID NO. 1-30. The method comprises the following steps: obtaining miRNA stably and differentially expressed by patients with early lung squamous carcinoma; selecting characteristic miRNA expression data, and carrying out data standardization on each sample; constructing an early prediction model for the standardized data by using a support vector machine; early prediction is performed based on the expression level of miRNA characteristic of the patient. The characteristic miRNA expression profile combination has high accuracy and precision in assessing the early risk of squamous cell lung carcinoma (AUC under ROC curve is 0.994). The early stage morbidity probability of the lung squamous carcinoma is calculated and given through a support vector machine model only by acquiring the relative expression quantity of the 30 miRNAs, and the early stage morbidity probability can be used as a reference basis for early stage prediction of the lung squamous carcinoma.)

1. A characteristic miRNA expression profile combination for predicting early squamous cell lung carcinoma is characterized by comprising hsa-let-7a-1, hsa-let-7a-2, hsa-let-7a-3, hsa-let-7b, hsa-let-7i, hsa-mir-101-1, hsa-mir-101-2, hsa-mir-103a-1, hsa-mir-103a-2, hsa-mir-10a, hsa-mir-126, hsa-mir-143, hsa-mir-146b, hsa-mir-181a-2, hsa-mir-182, hsa-mir-183, hsa-mir-22, hsa-mir-23a, hsa-mir-23b, hsa-mir-26a-1, hsa-mir-26a-2, hsa-mir-26b, hsa-mir-27a, hsa-mir-27b, hsa-mir-29a, hsa-mir-30a, hsa-mir-30d, hsa-mir-30e, hsa-mir-374a, and hsa-mir-99b, wherein the nucleotide sequences are shown in SEQ ID NO. 1-30.

2. A lung squamous carcinoma early stage prediction method based on miRNA expression profile combination characteristics is characterized by comprising the following steps:

step 1, miRNA stably and differentially expressed by patients with early lung squamous carcinoma is obtained;

step 2, selecting characteristic miRNA expression data, and carrying out data standardization on each sample;

step 3, constructing an early prediction model for the standardized data by using a support vector machine;

step 4, carrying out early prediction according to the expression level of the patient characteristic miRNA;

the method is useful for non-disease diagnostic and therapeutic purposes.

3. The prediction method according to claim 2, wherein the miRNA for obtaining stable differential expression of patient with early lung squamous carcinoma in step 1 is specifically:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of the patient with squamous cell lung cancer from a Genomic Data common Data Portal database to obtain a tumor tissue gene expression profile read counts value of the patient with squamous cell lung cancer, namely a sequencing read value, and carrying out logarithmic conversion;

step 1.2, selecting miRNA with the read counts of the miRNA in all samples being more than or equal to 10, then taking logarithm of the read counts of all the miRNA, setting the total number of the samples as n, the total number of the screened miRNA as m, v as the read counts of the miRNA, and u as an expression value after taking the logarithm, and then obtaining the result;

uij=log2vij,i∈(1,n),j∈(1,m) (1)

wherein i is the sample number, j is the miRNA number, uijThe expression value after taking logarithm of the No. i sample and No. j miRNA number, vijThe read counts values for the ith sample and the jth miRNA number;

step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting miRNA with variation coefficient smaller than 0.1 in tumor and normal samples, setting mu as the expression mean value of miRNA in all samples, and sigma as standard deviation, wherein the calculation formula of the variation coefficient is as follows:

wherein j is miRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejStandard deviation, μ for the jth miRNA numberjSetting m as the expression average value of the miRNA numbered by the jth miRNA1For the total number of stably expressed mirnas, there are:

m1=m{cvj≥10},j∈(1,m) (3)

step 1.5, selecting miRNA in tumor and normal samples for differential expression, and calculating the fold change f of the miRNA in the tumor and normal samples after logarithm taking by using expression values after logarithm taking, wherein the formula is as follows:

fj=μ1j2j,j∈(1,m1) (4)

whereinj is miRNA number, fjFold change for the jth miRNA number, μ1jExpression mean, μ, of tumor samples numbered for the jth miRNA2jThe expression mean value of the normal sample numbered for the jth miRNA;

then comparing the expression difference of miRNA in the tumor sample and the normal sample by using independent sample t test, wherein the independent sample t test formula is as follows:

Figure FDA0002617450530000022

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean expression of miRNA in tumor sample, mu2Is the mean value of the expression of miRNA in a normal sample,

Figure FDA0002617450530000031

correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions of the mirnas are as follows:

wherein j is miRNA number, qjRepresents the FDR corrected value of the jth miRNA number, pjP-value, r, from t-test representing the number of the j miRNAjP-value at m representing the number of the j miRNA1The sequenced positions in the individual mirnas;

finally, selecting miRNA with the multiple change f absolute value larger than 1 and FDR corrected q value smaller than or equal to 0.05, recording as characteristic miRNA, and setting the total number of characteristic miRNA as m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)。

4. the prediction method according to claim 2, wherein the miRNAs are respectively: hsa-let-7a-1, hsa-let-7a-2, hsa-let-7a-3, hsa-let-7b, hsa-let-7i, hsa-mir-101-1, hsa-mir-101-2, hsa-mir-103a-1, hsa-mir-103a-2, hsa-mir-10a, hsa-mir-126, hsa-mir-143, hsa-mir-146b, hsa-mir-181a-2, hsa-mir-182, hsa-mir-183, hsa-mir-22, hsa-mir-23a, hsa-mir-23b, hsa-mir-26a-1, hsa-mir-26a-2, hsa-mir-26b, hsa-mir-27a, hsa-mir-27b, hsa-mir-29a, hsa-mir-30d, hsa-mir-30e, hsa-mir-374a and hsa-mir-99b, the nucleotide sequences of which are respectively shown in SEQ ID NO.1-SEQ ID NO. 30.

5. The prediction method according to claim 2, wherein the selecting of the characteristic miRNA expression data in step 2, and the normalizing of the data for each sample specifically comprises:

Figure FDA0002617450530000041

wherein i is the sample number, j is the characteristic miRNA number, muiThe mean value, sigma, of all the miRNA expression characteristics of the ith sampleiAll characteristic miRNA standard deviations, u, of the ith sampleijTaking logarithmic characteristic miRNA expression value, uij' is the normalized miRNA value.

6. The prediction method according to claim 2, wherein the constructing of the early prediction model for the normalized data by using the support vector machine in the step 3 is specifically:

step 3.1, grouping all samples, dividing 80% of all samples into a training set and a verification set, and dividing the rest 20% of all samples into a test set; the training set and the verification set are used for 5-fold cross verification, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set; parameters are given, a training set is used for constructing a model, and a verification set is used for checking the accuracy of the model;

step 3.2, optimal parameter screening, wherein the parameter gamma in the SVM controls the width of a Gaussian kernel, C is a regularization parameter and limits the importance of each point, and a parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in the cross validation, a model is constructed by sequentially using the combination of every two parameters gamma and C, then the model accuracy is checked by using a validation set, for each parameter combination, 1 accuracy is generated in each validation of 5-fold cross validation, and 5 accuracies are generated by carrying out 5 times of validation in total; selecting a parameter combination with the highest average accuracy of 5 times of verification as an optimal parameter;

step 3.3, constructing a model by using the optimal parameters and data of the training set + the verification set, and finally evaluating the model by using the test set, wherein the evaluation indexes comprise accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mazes Correlation Coefficient (MCC) and Receiver Operating Curve (ROC) lower area (area under the curve, AUC), in the test set, defining that the tumor is actually normal and the tumor count is predicted as True Positive (TP), the tumor is actually normal and the tumor count is predicted as False Positive (FP), the tumor is actually normal and the tumor count is predicted as false positive (FN), and the evaluation calculation formula is as TN:

Figure FDA0002617450530000053

Figure FDA0002617450530000055

the accuracy, recall rate, specificity, F1 score and AUC in the above evaluation indexes return values between (0, 1), and the higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; the higher AUC indicates the higher probability of the positive case predicted by the classifier, so that the closer the indexes are to 1, the better the overall prediction effect of the model is;

step 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect; the final prediction model is constructed with the optimal parameter combinations using all the data.

7. The prediction method according to claim 2, wherein the early prediction according to the expression level of the patient characteristic miRNA in step 4 is specifically:

step 4.1, standardizing the characteristic miRNA expression data of the prediction sample, setting u as the characteristic miRNA expression value of the prediction sample, mu as the characteristic miRNA expression mean value of the prediction sample, and sigma as the standard deviation of the characteristic miRNA of the prediction sample, wherein the formula is as follows:

Figure FDA0002617450530000061

wherein j is the characteristic miRNA number, uj' is the normalized miRNA value;

step 4.2, substituting the miRNA value after the prediction sample is standardized into the final prediction for prediction; a prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

Technical Field

The invention belongs to the technical field of biotechnology and medicine, and particularly relates to a characteristic miRNA expression profile combination and an early lung squamous cell carcinoma prediction method.

Background

Squamous cell carcinoma of the lung (lung squamous cell carcinoma), accounts for 40% -51% of primary lung cancer. Squamous cell lung cancer is commonly seen in old men and has close relation with smoking. Squamous cell lung cancer is common in central lung cancer, and tends to grow in the chest cavity, and early squamous cell lung cancer often causes bronchoconstriction or obstructive pulmonary inflammation. Global Disease burden (GBD) data shows that the number of people with trachea, bronchi or lung cancer in 2017 is over 330 ten thousand globally, wherein the number of people with lung cancer in china is as high as 127 ten thousand. The number of deaths with the above cancers worldwide in 2016 was 188 ten thousand, accounting for 3.37% of the total deaths. The number of deaths in 2016 in China is 69 thousands, accounting for 6.62% of the total deaths. Statistics show a continuous increase in the prevalence and mortality of tracheal, bronchial and lung cancer worldwide from 1990 to 2017. The prevalence and mortality rates in china have increased year by year over the last decade and are growing at a rate higher than the global average.

A Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, and a decision boundary of the SVM is a maximum edge distance hyperplane for solving a learning sample. The SVM model represents instances as points in space, so that the mapping is such that instances of the individual classes are separated by as wide an apparent interval as possible. The new instances are then mapped to the same space and the categories are predicted based on which side of the interval they fall on. When the training data is linearly separable, the SVM is classified by hard interval maximization learning. When the training data is linearly non-separable, the SVM is classified by using a kernel technique and soft interval maximization learning. SVMs are powerful for medium-sized data sets with similar meaning of features and are also suitable for small data sets. In general, the prediction effect is good for the SVM data set with the sample size less than 1 ten thousand. SVM has a wide range of applications in disease diagnosis, tumor classification, tumor gene recognition, and the like.

Early diagnosis of tumors has been a difficult problem in the medical community. The existing early diagnosis methods mostly observe the expression level of a certain marker or a class of markers, and the ideal diagnosis effect is difficult to achieve. Since the expression profiles of these markers in tumor patients and normal populations partially overlap, it is difficult to define a cut-off for the markers that better separates tumor patients from normal populations. Therefore, the use of multiple marker expression signature combinations may be an effective method for early diagnosis of tumors. MicroRNA (miRNA) is a non-coding single-stranded RNA molecule of about 21-25 nucleotides in length encoded by an endogenous gene that regulates gene expression primarily in a variety of ways. miRNA is relatively stable in expression in human body and easy to detect. Since the expression distribution of individual mirnas overlaps in tumor and normal populations, it is difficult to define the critical value for early diagnosis.

Therefore, there is a need to provide a more stable diagnostic model of multiple differential miRNA expression signature combinations that will aid in the early prediction of squamous cell lung cancer.

Disclosure of Invention

In view of the above, the invention provides a characteristic miRNA expression profile combination and an early lung squamous cell carcinoma prediction method, which can accurately perform lung squamous cell carcinoma stage I/II prediction.

In order to solve the technical problem, the invention discloses a characteristic miRNA expression profile combination for predicting early squamous cell lung carcinoma, which comprises hsa-let-7a-1, hsa-let-7a-2, hsa-let-7a-3, hsa-let-7b, hsa-let-7i, hsa-mir-101-1, hsa-mir-101-2, hsa-mir-103a-1, hsa-mir-103a-2, hsa-mir-10a, hsa-mir-126, hsa-mir-143, hsa-mir-146b, hsa-mir-181a-2, hsa-mir-182, hsa-mir-183, hsa-mir-22, hsa-mir-23a, hsa-mir-23b, hsa-mir-26a-1, hsa-mir-26a-2, hsa-mir-26b, hsa-mir-27a, hsa-mir-27b, hsa-mir-29a, hsa-mir-30a, hsa-mir-30d, hsa-mir-30e, hsa-mir-374a and hsa-mir-99b, and the nucleotide sequences thereof are shown in SEQ ID NO. 1-30.

The invention also discloses a lung squamous carcinoma early stage prediction method based on the miRNA expression profile combination characteristics, which comprises the following steps:

step 1, miRNA stably and differentially expressed by patients with early lung squamous carcinoma is obtained;

step 2, selecting characteristic miRNA expression data, and carrying out data standardization on each sample;

step 3, constructing an early prediction model for the standardized data by using a support vector machine;

and 4, carrying out early prediction according to the expression level of the miRNA characteristic of the patient.

Optionally, the miRNA stably and differentially expressed by the patient at the early stage of squamous cell lung carcinoma obtained in step 1 specifically is:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of the patient with squamous cell lung cancer from a Genomic Data common Data Portal database to obtain readcounts numerical values of tumor tissue gene expression profiles of the patient with squamous cell lung cancer, namely sequencing read numerical values, and carrying out logarithmic conversion;

step 1.2, selecting miRNA with the read counts of the miRNA in all samples being more than or equal to 10, then taking logarithm of the read counts of all the miRNA, setting the total number of the samples as n, the total number of the screened miRNA as m, v as the readcounts of the miRNA, and u as an expression value after taking the logarithm, and then obtaining the result;

uij=log2vij,i∈(1,n),j∈(1,m) (1)

wherein i is the sample number, j is the miRNA number, uijThe expression value after taking logarithm of the No. i sample and No. j miRNA number, vijRead counts numbering the ith sample and the jth miRNA;

step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting miRNA with variation coefficient smaller than 0.1 in tumor and normal samples, setting mu as the expression mean value of miRNA in all samples, and sigma as standard deviation, wherein the calculation formula of the variation coefficient is as follows:

Figure BDA0002617450540000031

wherein j is miRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejStandard deviation, μ for the jth miRNA numberjSetting m as the expression average value of the miRNA numbered by the jth miRNA1MiRNAs for stable expressionTotal, then:

Figure BDA0002617450540000032

step 1.5, selecting miRNA in tumor and normal samples for differential expression, and calculating the fold change f of the miRNA in the tumor and normal samples after logarithm taking by using expression values after logarithm taking, wherein the formula is as follows:

wherein j is miRNA number, fjFold change for the jth miRNA number, μ1jExpression mean, μ, of tumor samples numbered for the jth miRNA2jThe expression mean value of the normal sample numbered for the jth miRNA;

then comparing the expression difference of miRNA in the tumor sample and the normal sample by using independent sample t test, wherein the independent sample t test formula is as follows:

Figure BDA0002617450540000042

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean expression of miRNA in tumor sample, mu2Is the mean value of the expression of miRNA in a normal sample,the variance of the miRNA in the tumor sample is shown,

Figure BDA0002617450540000044

miRNA variance for normal samples;

correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions of the mirnas are as follows:

wherein j is miRNA number, qjRepresents the FDR corrected value of the jth miRNA number, pjP-value, r, from t-test representing the number of the j miRNAjP-value at m representing the number of the j miRNA1The sequenced positions in the individual mirnas;

finally, selecting miRNA with the multiple change f absolute value larger than 1 and FDR corrected q value smaller than or equal to 0.05, recording as characteristic miRNA, and setting the total number of characteristic miRNA as m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)。

optionally, the mirnas are respectively: hsa-let-7a-1, hsa-let-7a-2, hsa-let-7a-3, hsa-let-7b, hsa-let-7i, hsa-mir-101-1, hsa-mir-101-2, hsa-mir-103a-1, hsa-mir-103a-2, hsa-mir-10a, hsa-mir-126, hsa-mir-143, hsa-mir-146b, hsa-mir-181a-2, hsa-mir-182, hsa-mir-183, hsa-mir-22, hsa-mir-23a, hsa-mir-23b, hsa-mir-26a-1, hsa-mir-26a-2, hsa-mir-26b, hsa-mir-27a, hsa-mir-27b, hsa-mir-29a, hsa-mir-30d, hsa-mir-30e, hsa-mir-374a and hsa-mir-99b, the nucleotide sequences of which are respectively shown in SEQ ID NO.1-SEQ ID NO. 30.

Optionally, the selecting characteristic miRNA expression data in step 2, and performing data normalization on each sample specifically includes:

wherein i is the sample number, j is the characteristic miRNA number, muiThe mean value, sigma, of all the miRNA expression characteristics of the ith sampleiAll characteristic miRNA standard deviations, u, of the ith sampleijTaking logarithmic characteristic miRNA expression value, uij' is the normalized miRNA value.

Optionally, the constructing an early prediction model for the normalized data by using the support vector machine in step 3 specifically includes:

step 3.1, grouping all samples, dividing 80% of all samples into a training set and a verification set, and dividing the rest 20% of all samples into a test set; the training set and the verification set are used for 5-fold cross verification, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set; parameters are given, a training set is used for constructing a model, and a verification set is used for checking the accuracy of the model;

step 3.2, optimal parameter screening, wherein the parameter gamma in the SVM controls the width of a Gaussian kernel, C is a regularization parameter and limits the importance of each point, and a parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in the cross validation, a model is constructed by sequentially using the combination of every two parameters gamma and C, then the model accuracy is checked by using a validation set, for each parameter combination, 1 accuracy is generated in each validation of 5-fold cross validation, and 5 accuracies are generated by carrying out 5 times of validation in total; selecting a parameter combination with the highest average accuracy of 5 times of verification as an optimal parameter;

step 3.3, constructing a model by using the optimal parameters and data of the training set + the verification set, and finally evaluating the model by using the test set, wherein the evaluation indexes comprise accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mazes Correlation Coefficient (MCC) and Receiver Operating Curve (ROC) lower area (area under the curve, AUC), in the test set, defining that the tumor is actually normal and the tumor count is predicted as True Positive (TP), the tumor is actually normal and the tumor count is predicted as False Positive (FP), the tumor is actually normal and the tumor count is predicted as false positive (FN), and the evaluation calculation formula is as TN:

Figure BDA0002617450540000062

Figure BDA0002617450540000063

Figure BDA0002617450540000064

Figure BDA0002617450540000065

Figure BDA0002617450540000066

the accuracy, recall rate, specificity, F1 score and AUC in the above evaluation indexes return values between (0, 1), and the higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; the higher AUC indicates the higher probability of the positive case predicted by the classifier, so that the closer the indexes are to 1, the better the overall prediction effect of the model is;

step 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect; the final prediction model is constructed with the optimal parameter combinations using all the data.

Optionally, the early prediction according to the expression level of the miRNA characteristic of the patient in step 4 specifically comprises:

step 4.1, standardizing the characteristic miRNA expression data of the prediction sample, setting u as the characteristic miRNA expression value of the prediction sample, mu as the characteristic miRNA expression mean value of the prediction sample, and sigma as the standard deviation of the characteristic miRNA of the prediction sample, wherein the formula is as follows:

wherein j is the characteristic miRNA number, uj' is the normalized miRNA value;

step 4.2, substituting the miRNA value after the prediction sample is standardized into the final prediction for prediction; a prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

Compared with the prior art, the invention can obtain the following technical effects:

1) the invention has fast prediction speed: the prediction model constructed by the invention can be used for rapidly predicting large-scale samples, and the prediction time of 100 samples only needs a few seconds.

2) The invention has high accuracy: the prediction model constructed by the method has high prediction accuracy and accuracy, and the area AUC under the ROC curve can reach 0.994.

3) The influence of the platform heterogeneity is small: due to the fact that miRNA expression values measured by different analysis platforms have large differences, the standardized characteristic miRNA expression values are used in prediction, and therefore the influence of platform heterogeneity is small.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of data screening and model building according to the present invention;

FIG. 2 is a cross-validation parameter optimization process for a support vector machine model according to the present invention;

FIG. 3 is a diagram of a test set evaluation index for a support vector machine model according to the present invention;

FIG. 4 is a support vector machine model test set ROC curve of the present invention.

Detailed Description

The following embodiments are described in detail with reference to the accompanying drawings, so that how to implement the technical features of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention discloses a lung squamous carcinoma early stage prediction method based on miRNA expression profile combination characteristics, which comprises the following steps:

step 1, obtaining miRNA (characteristic miRNA) stably and differentially expressed by a patient with early lung squamous carcinoma, wherein the detailed flow is shown in figure 1, and the method is implemented according to the following steps:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of patients with squamous cell lung cancer from a Genomic Data common Data Portal database, obtaining a tumor tissue gene expression profile sequencing read (read counts) value of the patients with squamous cell lung cancer, and carrying out logarithmic conversion;

step 1.2, selecting miRNA with certain expression abundance, namely the read counts of the miRNA in all samples are more than or equal to 10. Taking logarithm of the read counts of all miRNA, setting the total number of samples as n, the total number of screened miRNA as m, v as the read counts of miRNA, u as the expression value after taking logarithm, and then obtaining the result;

uij=log2vij,i∈(1,n),j(1,m) (1)

wherein i is the sample number, j is the miRNA number, uijThe expression value after taking logarithm of the No. i sample and No. j miRNA number, vijRead counts numbered for the ith sample, the jth miRNA.

Step 1.3, selecting squamous cell lung carcinoma patients with disease stages of I and II, recording the patients as squamous cell lung carcinoma early-stage patients, and recording the total number of squamous cell lung carcinoma early-stage patients as n';

step 1.4, selecting miRNA stably expressed in the tumor sample and the normal sample, namely miRNA with variation coefficient less than 0.1 in the tumor sample and the normal sample, setting mu as the expression mean value of any miRNA in all samples, and sigma as standard deviation, wherein the calculation formula of the variation coefficient is as follows:

wherein j is miRNA number, cvIs the coefficient of variation, cvjCoefficient of variation, σ, for the j-th samplejStandard deviation, μ for the jth miRNA numberjThe expression mean value of the miRNA numbered for the jth miRNA;

let m1For the total number of stably expressed mirnas, there are:

step 1.5, miRNA which are differentially expressed in tumor and normal samples are selected. Calculating the logarithm fold change f of the miRNA of the tumor sample and the normal sample by using the expression value after logarithm taking, wherein the formula is as follows:

Figure BDA0002617450540000093

wherein j is miRNA number, fjFold change for the jth miRNA number, μ1jExpression mean, μ, of tumor samples numbered for the jth miRNA2jExpression means of normal samples numbered for the jth miRNA.

Then comparing the expression difference of miRNA in the tumor sample and the normal sample by using independent sample t test, wherein the independent sample t test formula is as follows:

wherein n is1Is the number of tumor samples, n2Is a normal number of samples, mu1Mean expression of miRNA in tumor sample, mu2Is the mean value of the expression of miRNA in a normal sample,the variance of the miRNA in the tumor sample is shown,miRNA variance was normal sample.

Correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m1The sequenced positions of the mirnas are as follows:

wherein j is miRNA number, qjRepresents the FDR corrected value of the jth miRNA number, pjP-value, r, from t-test representing the number of the j miRNAjP-value at m representing the number of the j miRNA1Sequenced positions in individual mirnas.

Finally, selecting miRNA with the multiple change f absolute value larger than 1 and FDR corrected q value smaller than or equal to 0.05, recording as characteristic miRNA, and setting the total number of characteristic miRNA as m2Then, there are:

m2=m1{|fj|≥1,qj≤0.05},j∈(1,m1) (7)

through the screening, 30 lung squamous carcinoma characteristic miRNAs are finally obtained, and are shown in Table 1. The nucleotide probe sequences of 30 lung squamous carcinoma characteristic miRNAs are shown in Table 2.

TABLE 1 Lung squamous carcinoma characteristic miRNA

Figure BDA0002617450540000111

TABLE 2 nucleotide probe sequence of lung squamous carcinoma characteristic miRNA

Figure BDA0002617450540000121

Step 2, selecting characteristic miRNA expression data, and carrying out data standardization on each sample, wherein the method specifically comprises the following steps:

Figure BDA0002617450540000122

wherein i is the sample number and j is the characteristic miRNA number. Mu.siThe mean value, sigma, of all the miRNA expression characteristics of the ith sampleiAll characteristic miRNA standard deviations, u, of the ith sampleijTaking logarithmic characteristic miRNA expression value, uij' is the normalized miRNA value.

Step 3, constructing an early prediction model for the standardized data by using a support vector machine, specifically:

and 3.1, grouping all samples. 80% of all samples are divided into training set + validation set, and the remaining 20% are divided into test set. The training set and the verification set are used for 5-fold cross validation, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set. Given the parameters, the training set is used to construct the model, and the validation set is used to verify the accuracy of the model, as detailed in FIG. 1.

And 3.2, screening the optimal parameters. The parameter gamma in the SVM controls the width of the Gaussian kernel, and C is a regularization parameter, limiting the importance of each point. The parameter grid is set as:

gamma=[0.001,0.01,0.1,1,10,100](9)

C=[0.001,0.01,0.1,1,10,100](10)

in cross-validation, the model is constructed using a combination of every two parameters gamma and C in turn, and then the validation set is used to verify the model accuracy. For each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. And selecting the parameter combination with the highest average accuracy of 5 times of verification as the optimal parameter. Fig. 2 shows the cross-validation parameter optimization process, where the model cross-validation accuracy is highest when the parameter gamma is 1 and the parameter C is 1: 0.988. the optimal parameters of the model are therefore: gamma is 1, and C is 1.

And 3.3, constructing a model by using the optimal parameters and the data of the training set and the verification set, and finally evaluating the model by using the test set. The evaluation index includes accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mathematic Correlation Coefficient (MCC), and area under the subject operating curve (ROC) (AUC). In the test set, defining the tumor count as True Positive (TP), the tumor count as normal but predicted as False Positive (FP), the tumor count as true but predicted as normal False Negative (FN), the tumor count as normal but predicted as True Negative (TN); the above evaluation index calculation formula is:

Figure BDA0002617450540000141

Figure BDA0002617450540000143

Figure BDA0002617450540000144

Figure BDA0002617450540000146

the accuracy, recall, specificity, F1 score and AUC of the above assessment indices returned values between (0, 1); the higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier; therefore, the closer the above index is to 1, the better the prediction effect of the entire model is.

And 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect. The final prediction model is constructed with the optimal parameter combinations using all the data.

FIG. 3 shows the accuracy, recall, specificity, F1 score and MCC in the above evaluation criteria, wherein all 6 criteria are greater than 0.94; FIG. 4 shows the ROC curve and AUC, with an AUC of 0.994 in the test set. The evaluation indexes show that the model has good prediction effect. Thus, using all the data, the final prediction model is constructed with the optimal parameter combinations.

Step 4, carrying out early prediction according to the expression level of the patient characteristic miRNA, specifically comprising the following steps:

step 4.1, standardizing the characteristic miRNA expression data of the prediction sample, setting u as an expression value after logarithm of the characteristic miRNA of the prediction sample, mu as an expression mean value of the characteristic miRNA of the prediction sample, and sigma as a standard deviation of the characteristic miRNA of the prediction sample, wherein the formula is as follows:

wherein j is the characteristic miRNA number, uj' normalized miRNA expression values for the jth characteristic miRNA number.

The method randomly selects 10 samples for prediction, and eliminates the 10 samples when a final prediction model is constructed. The numbers of 10 selected samples and the values of the normalized characteristic mirnas are shown in table 3.

TABLE 3.10 sample numbers and values normalized for characteristic miRNAs

And 4.2, substituting the miRNA value after the standardization of the prediction sample into the final prediction for prediction. A prediction result of 1 indicates that squamous cell lung carcinoma has occurred, and a prediction result of 0 indicates that the lung carcinoma is normal.

The sample numbers of 10 cases, corresponding TCGA numbers, actual states and predicted results are shown in Table 4. The prediction results of 10 samples completely accord with the actual state, which shows that the invention can accurately predict the squamous cell lung carcinoma at early stage.

TABLE 4.10 sample numbers, corresponding TCGA numbers, actual and predicted states

In conclusion, the characteristic miRNA expression profile combination has high prediction accuracy, and can effectively perform early diagnosis of squamous cell lung carcinoma. In addition, the method has no platform dependency, and can predict data from various sources.

While the foregoing description shows and describes several preferred embodiments of the invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

SEQUENCE LISTING

<110> Kunming animal research institute of Chinese academy of sciences

<120> characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method

<130>2019

<160>30

<170>PatentIn version 3.3

<210>1

<211>16

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>1

tgggatgagg tagtag 16

<210>2

<211>15

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>2

aggttgaggt agtag 15

<210>3

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>3

gggtgaggta gtaggtt 17

<210>4

<211>16

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>4

gggaaggcag taggtt 16

<210>5

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>5

agcaaggcag tagcttg 17

<210>6

<211>15

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>6

tgccctggct cagtt 15

<210>7

<211>16

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>7

actgtccttt ttcggt 16

<210>8

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>8

tactgccctc ggcttct 17

<210>9

<211>16

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>9

caaggcagca ctgtaa 16

<210>10

<211>18

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>10

tattccccta gatacgaa 18

<210>11

<211>18

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>11

cgcattatta ctcacggt 18

<210>12

<211>15

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>12

gagctacagt gcttc 15

<210>13

<211>16

<212>DNA

<213> Artificial sequence (artificial sequence)

<400>13

ccagaactga gtccac 16

<210>14

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>14

ggtacagtca acggtca 17

<210>15

<211>20

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>15

tagttggcaa gtctagaacc 20

<210>16

<211>15

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>16

ttatggccct tcggt 15

<210>17

<211>19

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>17

acagttcttc aactggcag 19

<210>18

<211>18

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>18

ggaaatccct ggcaatgt 18

<210>19

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>19

ggtaatccct ggcaatg 17

<210>20

<211>21

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>20

cgtgcaagta accaagaata g 21

<210>21

<211>22

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>21

gaaacaagta atcaagaata gg 22

<210>22

<211>20

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>22

gagccaagta atggagaaca 20

<210>23

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>23

gcggaactta gccactg 17

<210>24

<211>18

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>24

gcagaactta gccactgt 18

<210>25

<211>20

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>25

taaccgattt cagatggtgc 20

<210>26

<211>17

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>26

gctgcaaaca tccgact 17

<210>27

<211>19

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>27

gcagcaaaca tctgactga 19

<210>28

<211>18

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>28

gctgtaaaca tccgactg 18

<210>29

<211>22

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>29

aattacaata caatctgata ag 22

<210>30

<211>13

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>30

cggacccaca gac 13

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种抑癌基因甲基化联合检测试剂盒及其应用

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!