Plasma-based protein profiling for early lung cancer prognosis

文档序号:1549556 发布日期:2020-01-17 浏览:22次 中文

阅读说明:本技术 用于早期肺癌预后的基于血浆的蛋白质概况分析 (Plasma-based protein profiling for early lung cancer prognosis ) 是由 C·戈贝尔 C·劳登 T·C·龙 于 2018-04-04 设计创作,主要内容包括:本发明提供了能够用于非小细胞肺癌诊断的生物标志物和生物标志物的组合。将这些生物标志物的测量值输入分类系统,诸如随机森林,以协助确定个体具有非小细胞肺癌的可能性。还提供了包含用于检测生物标志物和生物标志物组合的试剂盒以及协助诊断非小细胞肺癌的系统。(The present invention provides biomarkers and combinations of biomarkers that can be used for the diagnosis of non-small cell lung cancer. Measurements of these biomarkers are input into a classification system, such as a random forest, to assist in determining the likelihood that an individual has non-small cell lung cancer. Also provided are kits comprising the use for detecting biomarkers and biomarker combinations and systems that assist in the diagnosis of non-small cell lung cancer.)

1. A method of classifying test data, the test data comprising a plurality of biomarker metrics, the biomarker metrics being biomarker metrics for respective items in a set of biomarkers, the method comprising:

receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject;

evaluating the test data using at least one processor, the evaluation being performed using classifiers that are electronic representations of a classification system, each of the classifiers having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding set of human biomarkers, each training data vector further comprising a classification as to whether the corresponding individual has diagnosed NSCLC; and

outputting, using at least one processor, a classification of the sample from a human test subject, the classification being a classification as to the likelihood of the presence or development of NSCLC in the subject based on the assessing step,

wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

2. A method of classifying test data, the test data comprising a plurality of biomarker metrics, the biomarker metrics being biomarker metrics for respective items in a set of biomarkers, the method comprising:

accessing, using at least one processor, a set of electronically stored training data vectors, each training data vector representing an individual and comprising a biomarker metric for each biomarker in the corresponding person biomarker set, each training data vector further comprising a classification as to whether the corresponding person has diagnosed NSCLC;

training an electronic representation of a classification system using the electronically stored set of training data vectors;

receiving, at least one processor, test data comprising a plurality of biomarker metrics for the set of biomarkers in a human test subject;

evaluating test data using at least one processor, the evaluation being made using an electronic representation of the classification system; and

outputting a classification of the human test subject, the classification being a classification as to the subject's likelihood of existence or progression of non-small cell lung cancer based on the assessing step,

wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

3. The method of claim 1 or 2, wherein the classification system is selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

4. A method as claimed in any one of claims 1 to 3, wherein the classification system comprises a random forest.

5. A method according to any one of claims 1-3, wherein the classification system comprises AdaBoost.

6. The method of any one of claims 1-3, wherein the classification system comprises naive Bayes.

7. The method of any of claims 1-3, wherein the classification system comprises a support vector machine.

8. The method of any one of claims 1-3, wherein the classification system comprises LASSO.

9. The method of any one of claims 1-3, wherein the classification system comprises ridge regression.

10. The method of any one of claims 1-3, wherein the classification system comprises a neural network.

11. The method of any one of claims 1-3, wherein the classification system comprises a genetic algorithm.

12. The method of any of claims 1-3, wherein the classification system comprises an elastic mesh.

13. The method of any of claims 1-3, wherein the classification system comprises a gradient enhanced tree.

14. The method of any one of claims 1-3, wherein the classification system comprises a Bayesian neural network.

15. The method of any one of claims 1-3, wherein the classification system comprises k-nearest neighbors.

16. The method of any one of claims 1-15, wherein the test data and each training data vector further comprises at least one other feature selected from the group consisting of: sex, age and smoking status of the individual.

17. The method of any one of claims 1-16, wherein the test data comprises two or more duplicate data vectors, each containing an individual determination of a biomarker metric for a plurality of biomarkers in a physiological sample from a human subject.

18. The method of claim 17, wherein the sample is classified as likely to have development of NSCLC if any of the repeated data vectors is classified as positive for NSCLC according to any classifier in the classification system.

19. The method of any one of claims 1-18, wherein the biomarker set comprises 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, or 33 biomarkers.

20. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, CYFRA-21-1, MIF, sICAM-1, SAA, or combinations thereof, and the physiological sample is a biological fluid.

21. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, IL-10, and NSE.

22. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, and IL-10.

23. The method of any one of claims 1-19, wherein the biomarkers are proportional to the respective concentration levels of biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, resistin, MPO, NSE, GRO, CEA, CXCL9, MIF, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, or a combination thereof, and the physiological sample is a biological fluid.

24. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, and IL-2.

25. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, and leptin.

26. The method of any one of claims 1-19, wherein the biomarkers are proportional to the respective concentration levels of biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO, CEA, CXCL9, IL-2, SAA, PDFG-AB/BB, or a combination thereof, and the physiological sample is a biological fluid.

27. The method of any one of claims 1-19, wherein the biomarker metric is proportional to a respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, and MMP-7.

28. The method of any one of claims 1-27, wherein the biomarker is a peptide, a protein, a peptide and a protein carrying a post-translational modification, or a combination thereof.

29. The method of claim 28, wherein the physiological sample is a biological fluid.

30. The method of claim 29, wherein the biological fluid is blood, serum, plasma, or a mixture thereof.

31. A method as claimed in any one of claims 1 to 30, wherein the classification system is a random forest classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 individual trees.

32. The method of any one of claims 1-30, wherein the classifier is an AdaBoost classifier comprising 50, 100, 150, 200, 250, 300, 400, 500, 750, or 1,000 iterations.

33. The method of any one of claims 1-30, wherein the classifier is a support vector machine classifier that includes a kernel that is a polynomial, gaussian radial basis, hyperbolic tangent, or trigonometric function.

34. The method of any one of claims 1-30, wherein the classifier is a LASSO classifier that includes constraints of 0.1, 0.5, 1,2, 10, 100.

35. The method of any one of claims 1-30, wherein the classifier is a ridge regression classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

36. The method of any one of claims 1-30, wherein the classifier is a neural network classifier comprising 1,2, 4, or 5 hidden layers.

37. The method of any one of claims 1-30, wherein the classifier is a neural network classifier comprising a convolutional neural network and a recurrent neural network.

38. The method of any of claims 1-30, wherein the classifier is an elastic web classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

39. The method of any one of claims 1-30, wherein the classifier is a gradient enhanced tree classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75, or 100 individual trees.

40. The method of any one of claims 1-30, wherein the classifier is a bayesian neural network classifier comprising 1,2, 4, or 5 hidden layers.

41. The method of any one of claims 1-30, wherein the classifier is a k-nearest neighbor classifier that includes 1,2, 4, 5, 8, or 10 neighbors.

42. The method of any one of claims 1-41, wherein the method further comprises determining biomarker levels in a physiological sample from the subject.

43. The method of any one of claims 1-42, wherein the patient is 45 years old or older, is a long-term smoker, has been diagnosed with an indeterminate nodule in the lung, or a combination thereof.

44. The method of any one of claims 1-43, wherein the method further comprises determining each biomarker metric in a physiological sample obtained from the subject.

45. The method of any one of claims 1-44, wherein the subject exhibits at least one lung nodule detectable by a computed tomography scan.

46. The method of any one of claims 1-45, wherein the method further comprises testing lung nodules by low dose computed tomography.

47. The method of any one of the preceding claims, wherein the subject is at risk for having NSCLC.

48. The method of any one of the preceding claims, further comprising the step of treating NSCLC in the subject.

49. The method of any one of the preceding claims, wherein the subject is a human.

50. The method of any one of the preceding claims, wherein the subject is a female.

51. The method of any one of the preceding claims, wherein the subject is a male.

52. The method of any one of the preceding claims, wherein the patient is 45 years old or older, is a long-term smoker, has been diagnosed with an indeterminate nodule in the lung, or a combination thereof.

53. The method of any one of claims 1-52, wherein the method further comprises:

(a) obtaining a physiological sample from a subject; and

(b) measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

54. The method of claim 53, wherein the method comprises measuring a set of at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 21 biomarkers in the sample.

55. The method of any one of claims 1-54, wherein the biomarker metric is indicative of non-small cell lung cancer.

56. The method of any one of claims 1-55, wherein the biomarker metric is indicative of early stage non-small cell lung cancer, preferably stage I.

57. The method of any one of claims 1-56, wherein the subject is at risk for non-small cell lung cancer.

58. The method of any one of claims 1-57, wherein the biomarker metric is measured by: radiation-immunoassay, enzyme-linked immunosorbent assay (ELISA), Q-PlexTM multiplex assay, liquid chromatography-mass spectrometry (LCMS), flow cytometry multiplex immunoassay, high pressure liquid chromatography with radiation or spectral detection by visible or ultraviolet light absorbance, mass spectrometry qualitative and quantitative analysis, western blot, one or two dimensional gel electrophoresis with quantitative visualization by detection of radioactive, fluorescent or chemiluminescent probes or nuclei, antibody-based detection with absorption or fluorescence photometry, quantification by luminescence of any of a variety of chemiluminescent reporter systems, enzymatic assays, immunoprecipitation or immunocapture assays, solid and liquid phase immunoassays, quantitative multiplex immunoassays, protein arrays or chips, plate assays, printed array immunoassays or combinations thereof.

59. The method of any one of claims 1-58, wherein the biomarker metric is measured by immunoassay.

60. A method for diagnosing stage I non-small cell lung cancer comprising:

(a) obtaining a physiological sample from a subject;

(b) measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4;

(c) receiving, on at least one processor, test data comprising a biomarker metric for each biomarker in a set of biomarkers in a physiological sample from a human test subject;

(d) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding set of human biomarkers, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and

(e) outputting, using at least one processor, a classification of the sample from a human test subject, the classification being a classification as to the likelihood of the presence or progression of NSCLC in the subject based on the assessing step.

61. The method of claim 60, wherein the classification system comprises content selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

62. A method as claimed in claim 60, wherein the classification system comprises a random forest.

63. The method of claim 60, wherein the classification system comprises AdaBoost.

64. The method of claim 60, wherein the classification system comprises naive Bayes.

65. The method of claim 60, wherein the classification system comprises a support vector machine.

66. The method of claim 60, wherein the classification system comprises LASSO.

67. The method of claim 60, wherein the classification system comprises ridge regression.

68. The method of claim 60, wherein the classification system comprises a neural network.

69. The method of claim 60, wherein the classification system comprises a genetic algorithm.

70. The method of claim 60, wherein the classification system comprises an elastic mesh.

71. The method of claim 60, wherein the classification system comprises a gradient enhancement tree.

72. The method of claim 60, wherein the classification system comprises a Bayesian neural network.

73. The method of claim 60, wherein the classification system comprises k-nearest neighbors.

74. A method as claimed in any of claims 60 to 73, wherein the classifier is a random forest classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 individual trees.

75. The method of any one of claims 60-73, wherein the classifier is an AdaBoost classifier comprising 50, 100, 150, 200, 250, 300, 400, 500, 750, or 1,000 iterations.

76. The method of any one of claims 60-73, wherein the classifier is a support vector machine classifier comprising a kernel, the kernel being a polynomial, Gaussian radial basis, hyperbolic tangent, or trigonometric function.

77. The method of any of claims 60-73, wherein the classifier is a LASSO classifier that includes constraints of 0.1, 0.5, 1,2, 10, 100.

78. The method of any one of claims 60-73, wherein the classifier is a ridge regression classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

79. The method of any one of claims 60-73, wherein the classifier is a neural net classifier comprising 1,2, 4, or 5 hidden layers.

80. The method of any one of claims 60-73, wherein the classifier is a neural network classifier comprising a convolutional neural network and a recurrent neural network.

81. The method of any of claims 60-73, wherein the classifier is an elastic web classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

82. The method of any one of claims 60-73, wherein the classifier is a gradient enhanced tree classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75, or 100 individual trees.

83. The method of any one of claims 60-73, wherein the classifier is a Bayesian neural network classifier comprising 1,2, 4, or 5 hidden layers.

84. The method of any one of claims 60-73, wherein the classifier is a k-nearest neighbor classifier that includes 1,2, 4, 5, 8, or 10 neighbors.

85. The method of any one of claims 60-84, wherein the biomarker is a peptide, a protein, a peptide carrying a post-translational modification, a protein carrying a post-translational modification, or a combination thereof.

86. The method of any one of claims 60-85, wherein the physiological sample is a biological fluid.

87. The method of claim 86, wherein the biological fluid is whole blood, plasma, serum, or a combination thereof.

88. A method of detecting a plurality of biomarkers, comprising:

(a) obtaining a physiological sample from a subject; and

(b) measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

89. The method of claim 88, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, SAA, IL-2, and PDGF-AB/BB.

90. The method of claim 88 or 89, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, and SAA.

91. The method of any one of claims 88-90, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO-Pan, CEA, CXCL9/MIG, SAA, IL-2, and PDGF-AB/BB.

92. The method of any one of claims 88-91, wherein the set comprises at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 21 biomarkers.

93. The method of any of claims 88-92, wherein the subject is at risk for non-small cell lung cancer.

94. The method of any of claims 88-93, wherein the biomarker metric is indicative of non-small cell lung cancer.

95. The method of any of claims 88-94, wherein the biomarker metric is indicative of early stage non-small cell lung cancer, optionally, stage I non-small cell lung cancer.

96. The method of any one of claims 88-95, wherein the biomarker is a peptide, a protein, a peptide carrying a post-translational modification, a protein carrying a post-translational modification, or a combination thereof.

97. The method of any one of claims 88-96, wherein the physiological sample is whole blood, plasma, serum, or a combination thereof.

98. The method of any one of claims 1-97, wherein the biomarker metric does not indicate asthma, breast cancer, prostate cancer, colorectal cancer, pancreatic cancer, or a combination thereof.

99. The method of any one of claims 88-98, wherein the method further comprises:

(a) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject;

(b) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding set of human biomarkers, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and

(c) outputting, using at least one processor, a classification of the sample from a human test subject, the classification being a classification as to the likelihood of the presence or progression of NSCLC in the subject based on the assessing step.

100. The method of claim 99, wherein the classification system comprises content selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

101. A method as claimed in claim 99 or 100, wherein the classification system comprises a random forest.

102. A method according to claim 99 or 100, wherein the classification system comprises AdaBoost.

103. The method of claim 99 or 100, wherein the classification system comprises naive bayes.

104. The method of claim 99 or 100, wherein the classification system comprises a support vector machine.

105. The method of claim 99 or 100, wherein the classification system comprises LASSO.

106. The method of claim 99 or 100, wherein the classification system comprises ridge regression.

107. The method of claim 99 or 100, wherein the classification system comprises a neural network.

108. The method of claim 99 or 100, wherein the classification system comprises a genetic algorithm.

109. The method of claim 99 or 100, wherein the classification system comprises an elastic mesh.

110. The method of claim 99 or 100, wherein the classification system comprises a gradient enhancement tree.

111. The method of claim 99 or 100, wherein the classification system comprises a bayesian neural network.

112. The method of claim 99 or 100, wherein the classification system comprises k-nearest neighbors.

113. A method as claimed in any one of claims 99 to 112, wherein the classifier is a random forest classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 individual trees.

114. The method of any one of claims 99-113, wherein the classifier is an AdaBoost classifier comprising 50, 100, 150, 200, 250, 300, 400, 500, 750, or 1,000 iterations.

115. The method of any of claims 99-113, wherein the classifier is a support vector machine classifier comprising a kernel, the kernel being a polynomial, gaussian radial basis, hyperbolic tangent, or trigonometric function.

116. The method of any of claims 99-113, wherein the classifier is a LASSO classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

117. The method of any of claims 99-113, wherein the classifier is a ridge regression classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

118. The method of any one of claims 99-113, wherein the classifier is a neural network classifier comprising 1,2, 4, or 5 hidden layers.

119. The method of any one of claims 99-113, wherein the classifier is a neural network classifier that includes a convolutional neural network and a recurrent neural network.

120. The method of any of claims 99-113, wherein the classifier is an elastic web classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

121. The method of any one of claims 99-113, wherein the classifier is a gradient enhanced tree classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75, or 100 individual trees.

122. The method of any one of claims 99-113, wherein the classifier is a bayesian neural network classifier comprising 1,2, 4, or 5 hidden layers.

123. The method of any one of claims 99-113, wherein the classifier is a k-nearest neighbor classifier that includes 1,2, 4, 5, 8, or 10 neighbors.

124. A method of determining the presence of non-small cell lung cancer early in the development of a disease by measuring the expression levels of a set of biomarkers in a subject, comprising:

determining a biomarker metric for a biomarker set in a physiological sample by immunoassay, wherein the biomarker set comprises at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4;

classifying said sample for the presence or development of non-small cell lung cancer in said subject using said biomarker metric in a classification system.

125. The method of claim 124, wherein the biomarker is a peptide, a protein, a peptide carrying a post-translational modification, a protein carrying a post-translational modification, or a combination thereof.

126. The method of claim 124 or 125, wherein the physiological sample is whole blood, plasma, serum, or a combination thereof.

127. The method of any one of claims 124-126, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, SAA, IL-2, and PDGF-AB/BB.

128. The method of any one of claims 124-126, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, and SAA.

129. The method of claim 124-126, wherein the set of at least four biomarkers is selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO-Pan, CEA, CXCL9/MIG, SAA, IL-2, and PDGF-AB/BB.

130. The method of any one of claims 124-126, wherein the set comprises at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 19 biomarkers.

131. The method of any of claims 52-58, wherein the classification system comprises content selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

132. A method as claimed in claim 1 or 2, wherein the classification system comprises a random forest.

133. A method as claimed in claim 1 or 2, wherein the classification system comprises AdaBoost.

134. The method of claim 1 or 2, wherein the classification system comprises naive bayes.

135. The method of claim 1 or 2, wherein the classification system comprises a support vector machine.

136. The method of claim 1 or 2, wherein the classification system comprises LASSO.

137. The method of claim 1 or 2, wherein the classification system comprises ridge regression.

138. The method of claim 1 or 2, wherein the classification system comprises a neural network.

139. The method of claim 1 or 2, wherein the classification system comprises a genetic algorithm.

140. The method of claim 1 or 2, wherein the classification system comprises an elastic net.

141. The method of claim 1 or 2, wherein the classification system comprises a gradient enhancement tree.

142. The method of claim 1 or 2, wherein the classification system comprises a bayesian neural network.

143. The method of claim 1 or 2, wherein the classification system comprises k-nearest neighbors.

144. A method as claimed in any one of claims 99 to 113, wherein the classifier is a random forest classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 individual trees.

145. The method of any one of claims 99-113, wherein the classification is an AdaBoost classifier comprising 50, 100, 150, 200, 250, 300, 400, 500, 750, or 1,000 iterations.

146. The method of any of claims 99-113, wherein the classifier is a support vector machine classifier comprising a kernel, the kernel being a polynomial, gaussian radial basis, hyperbolic tangent, or trigonometric function.

147. The method of any of claims 99-113, wherein the classifier is a LASSO classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

148. The method of any of claims 99-113, wherein the classifier is a ridge regression classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

149. The method of any one of claims 99-113, wherein the classifier is a neural network classifier comprising 1,2, 4, or 5 hidden layers.

150. The method of any one of claims 99-113, wherein the classifier is a neural network classifier that includes a convolutional neural network and a recurrent neural network.

151. The method of any of claims 99-113, wherein the classifier is an elastic web classifier comprising constraints of 0.1, 0.5, 1,2, 10, 100.

152. The method of any one of claims 99-113, wherein the classifier is a gradient enhanced tree classifier comprising 5, 10, 15, 20, 25, 30, 40, 50, 75, or 100 individual trees.

153. The method of any one of claims 99-113, wherein the classifier is a bayesian neural network classifier comprising 1,2, 4, or 5 hidden layers.

154. The method of any one of claims 99-113, wherein the classifier is a k-nearest neighbor classifier that includes 1,2, 4, 5, 8, or 10 neighbors.

155. A method of classifying test data, the test data comprising a plurality of biomarker metrics, the biomarker metrics being biomarker metrics for respective items in a set of biomarkers, the method comprising:

receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject;

evaluating the test data using at least one processor, the evaluation being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding set of human biomarkers, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and

outputting, using at least one processor, a classification of the sample from a human test subject, the classification being a classification as to the likelihood of the presence or development of NSCLC in the subject based on the assessing step,

wherein the biomarker set comprises at least eight (8) biomarkers selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, IL-10, and NSE.

156. A system for classifying test data, the test data comprising a plurality of biomarker metrics, the biomarker metrics being biomarker metrics for respective items in a set of biomarkers, the system comprising:

at least one processor coupled to the electronic storage device, comprising an electronic representation of a classifier trained with a set of electronically stored training data vectors, the processor being arranged to receive test data comprising a plurality of biomarker metrics of a set of biomarkers of a human test subject, according to any of the preceding claims, the at least one processor being further arranged to evaluate the test data using the electronic representation of the one or more classifiers and to output a classification of the human test subject based on the evaluation result,

wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

157. A non-transitory computer readable storage medium having an executable program stored thereon, wherein the program instructs a microprocessor to perform the steps of:

receiving a biomarker metric for a plurality of biomarkers in a physiological sample of a subject; and

classifying a sample based on the biomarker metric, the classifying being performed using a classification system and at least one processor, wherein the classification of the sample is indicative of a likelihood of presence or progression of non-small cell lung cancer (NSCLC) in the subject,

wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

Technical Field

The present invention relates to the use of biomarkers and kits thereof for the detection, identification and diagnosis of pulmonary disease, and systems based on biomarker assistance in determining the likelihood of the presence or absence of pulmonary disease. More specifically, the present invention relates to the diagnosis of non-small cell lung cancer (NSCLC) by measuring the expression levels of specific biomarkers and inputting these measurements into a classification system such as Random Forest (Random Forest).

Description of the Related Art

Lesions of human lung tissue

American Cancer Society, Inc, predicts that 229,400 new respiratory Cancer cases and 164,840 people die from respiratory Cancer in 2007 alone. While the five-year survival rate for all cancer cases is 46% when cancer is detected but still localized, the five-year survival rate for lung cancer patients is only 13%. Accordingly, only 16% of lung cancers are found before the spread of the disease. Lung cancer is generally classified into two major types according to the pathology of the cancer cells. Each type is named according to the type of cells that are converted into cancer. Small cell lung cancer is derived from small cells in human lung tissue, whereas non-small cell lung cancer generally encompasses lung cancer of all non-small cell types. Non-small cell lung cancer is classified as a group because the treatment is generally the same for all non-small cell types. Non-small cell lung cancer (NSCLC) accounts for approximately 75% of all lung cancers.

The main factor in the low survival rate of lung cancer patients is due to the fact that lung cancer is difficult to diagnose early. Current methods of diagnosing lung cancer or identifying its presence in humans are limited to performing X-ray, Computed Tomography (CT) scans, and similar lung examinations to physically determine the presence or absence of a tumor. Diagnosis of lung cancer should generally only be made with symptoms that are evident or already present for a long time, and after the disease has been present in the human body long enough to produce a physically detectable quality.

Diagnosis of lung cancer

Neither sputum cytology nor chest X-ray examination found a screening for early detection of lung cancer. On the other hand, low-dose computed tomography shows promise when applied to high risk populations (e.g., smokers). Aberle et alN.Engl.J.Med.(2011)365:395-409. However, it is still difficult to obtain criteria for defining high risk populations that may benefit from such screening, and the utility of this technique for screening more general populations is unclear. Although the large lung nodules found by CT scanning are clearly associated with the likelihood of malignancy, the vast majority of small nodules: (<7mm) appeared to be benign. Macmahon et alRadiology(2005)237:395-400. Therefore, there is a need for additional screening methods to aid in the early detection and diagnosis of lung cancer.

Multivariate medical data analysis

Logistic regression began to be used in the medical field in the late 1980 s and early 1990 s. An example of the use of logistic regression in medicine is the trauma correction severity score (TRISS). See, assessment of wound care: TRISS Method (EvaluationTrauma Care: The TRISS Method), Boyd, CR, Tolson, MA and Copes, WS.1987, Journal of Tracuma, Vol.27, p.370-378. TRISS is used in hospitals in the United states of America as a method to predict post-traumatic in-hospital mortality and to perform in-hospital comparisons of the quality of trauma surgeries. TRISS is based on a logistic regression model of mortality after a traumatic event, with injury severity score, corrected wound score, and age as covariates.

Logistic regression is defined as the logarithm of the probability of an event (also called log-odd of events), defined as

Figure BDA0002301358220000021

Where p is the probability of an event occurring). Suppose that

Figure BDA0002301358220000022

The logistic regression model may be expressed as y ═ β' x, where x is the vector of covariates and β is the vector of the effect of each covariate. Maximizing the likelihood function of the model yields an estimate of β. The logistic discrimination model is a logistic regression model that converts the predicted probabilities into group labels.

The logistic regression model is based on the assumption that: the effect of each covariate is linear with respect to the log probability of the event. Harrell, Frank, Regression Modeling Strategies, New York, Schpringer Press (Springer),2001, page 217. From a categorical perspective, the linearity of each covariate with respect to the log probability of the event may be sufficient to achieve high accuracy, even in the test set; however, violating this assumption may result in the model estimating the impact severely incorrectly and thus in poor performance.

Stable estimates and reliable and accurate classification require a large number of variable mean event occurrences (EPV) (Performance of logistic regression modeling: besides variable mean event occurrences, the role of data structure (Performance of term of logistic regression modeling: beyond the term of event mean, the term of data structure), Courvoiiser, DS et al, 2011, Journal of clinical experience, Vol.64, pp.993-. Required EPV with number of variables and with odds ratio (via e)βEstimates) tend to be consistent and vary. When the number of variables is equal to 25, for example, as shown by Courvoisier et al (cite above, page 997), depending on the relationship between covariates and event probabilities, EPV 25 may not be sufficient to generate sufficient force and draw such a conclusion: no single rule based on EPV can guarantee accurate estimation of logistic regression parameters (quoted above, page 1000).

Classification system

Various classification systems, such as machine learning methods for data analysis and data mining, have been explored to recognize patterns and extract important information contained in large databases in the presence of other information that may be nothing more than irrelevant data. The learning machine includes an algorithm that can be trained to summarize using data with known classifications. Trained learning machine algorithms can then be applied to predict outcomes without the outcome being known, i.e., to classify the data according to the learned pattern. Machine learning methods, including neural networks, hidden markov models, belief networks, and kernel-based classifiers such as support vector machines, can be used for problems characterized by large amounts of data, noisy patterns, and the lack of general theory.

Many successful approaches to pattern classification, regression, and clustering problems rely on kernels for determining the similarity of a pair of patterns. These kernels are typically defined for patterns that can be represented as real vectors. For example, the linear kernel, the radial basis kernel, and the polynomial kernel all measure the similarity of a pair of real vectors. Such kernels are suitable when the data can be optimally represented in this way as a sequence of real numbers. The selection of the kernel corresponds to the selection of the representation of the data in the feature space. In many applications, these patterns have a greater degree of structuring (hierarchy of structure). These structures can be utilized to improve the performance of the learning algorithm. Examples of structured data types common in machine learning applications include strings, documents, trees, charts, such as websites or chemical molecules, signals, such as microarray expression profiles, spectra, images, spatiotemporal data, relational data, biochemical concentrations, and the like.

Classification systems have been used in the medical field. For example, methods have been proposed for diagnosing and predicting the occurrence of medical conditions using various computer systems and classification systems, such as support vector machines. See, for example, U.S. patent nos. 7,321,881; 7,467,119, respectively; 7,505,948, respectively; 7,617,163, respectively; 7,676,442, respectively; 7,702,598, respectively; 7,707,134, respectively; and 7,747,547. The methods described in these patents have not been shown to provide consistently high levels of accuracy in diagnosing and/or predicting lung diseases such as non-small lung cancer. It would be desirable to develop methods for determining the presence of lung cancer early in the disease progression as well as methods for diagnosing non-small cell lung cancer prior to the earliest onset of clinically significant symptoms.

Summary of the preferred embodiments of the invention

The present invention provides a classification system that uses a robust method of evaluating a set of biomarkers of a subject using various classifiers, such as random forest (random forest). The inventors have developed, based in part on the classification of the present invention, a method of physiological characterization in a subject, the method comprising first obtaining a physiological sample of the subject; then determining a biomarker metric (measure) for a plurality of biomarkers in the sample; and finally classifying the sample based on the biomarker metric using a classification system, wherein the classification of the sample is associated with a change in the physiological state or condition or disease state of the subject. Typically, the classification system comprises a machine learning system, such as a classification and regression tree based classification system. The inventors' physiological characterization methods, based in part on the classification of the invention, provide a diagnostic indication of the presence or absence of non-small cell lung cancer or the stage of development of non-small cell lung cancer (e.g., the advanced early stage (stage I)) in a subject.

For each subject for which a biomarker metric is obtained, the biomarker metrics are typically arranged in a vector. In addition to specific biomarker metrics, each metric may include other information associated with the subject, including gender, age, smoking history, metrics of other biomarkers, other characteristics of the subject's health history, and the like. The set of training vectors may comprise at least 30 vectors, at least 50 vectors or at least 100 vectors.

In a preferred mode of any of the embodiments described herein, a human subject is considered NSCLC positive if any replicate sample from the subject is classified as positive by any one, any two, any three, any four, any five, any six, any seven, or any eight classifiers (up to all classifiers). In a preferred mode of any of the embodiments described herein, an object may be considered positive if multiple repetitions of a single classifier (e.g., all repetitions of each classifier, two or more repetitions of a single classifier, three repetitions of a single classifier) or if multiple repetitions of all classifiers used (e.g., two repetitions between the number of classifiers used in a set of classifiers, three repetitions between the number of classifiers used in a set of classifiers, four repetitions between the number of classifiers used in a set of classifiers) are classified as positive. In a preferred mode of any of the embodiments described herein, detection accuracy, sensitivity, specificity, and positive and negative values are determined for the test dataset, and for various possible positive counts (i.e., zero to the number of classifiers multiplied by the number of repetitions). In a preferred mode of any of the embodiments described herein, the number of positive replicates and/or classifiers required to return a positive can then be determined based on the accuracy, sensitivity, specificity and positive-negative values examined. In a preferred mode of any of the embodiments shown herein, the accuracy, sensitivity, specificity, positive predictive value and/or negative predictive value is higher than 0.7. In a preferred mode of any of the embodiments shown herein, the accuracy, sensitivity, specificity, positive predictive value and/or negative predictive value is higher than 0.8. In a preferred mode of any of the embodiments shown herein, at least one, more preferably two or more of accuracy, sensitivity, specificity, positive predictive value and/or negative predictive value is higher than 0.9. In a preferred mode of any of the embodiments shown herein, at least one, more preferably two or more of accuracy, sensitivity, specificity, positive predictive value and negative predictive value is above 0.95. In a preferred mode of any of the embodiments shown herein, at least one, more preferably two or more of accuracy, sensitivity, specificity, positive predictive value and negative predictive value is higher than 0.98.

Embodiments of the present invention may be used in an enhanced method of screening a human subject to determine whether said human is likely to have NSCLC, the enhancement comprising classifying test data from the human subject using a method according to any of the embodiments of the present invention, wherein said human subject is a subject exhibiting at least one lung nodule detectable by computed tomography scanning another use of an embodiment of the present invention is to provide an enhanced method of screening a human subject to determine whether said human is likely to have NSCLC, wherein human subjects classified as positive for NSCLC using a method of the present invention are further tested for lung nodules by low-dose computed tomography.

In one mode, the present invention provides a method of classifying test data, the test data comprising a plurality of biomarker metrics, the biomarker metrics being biomarker metrics for respective items in a set of biomarkers, the method comprising: (a) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject; (b) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations (electronic representations) of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding human biomarker set, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and (c) outputting, using at least one processor, a classification of the sample from a human test subject based on the classification of the likelihood of presence or progression of NSCLC in the subject based on the assessing step, wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

In another mode, the invention provides a method of classifying test data, the test data comprising a plurality of biomarker measures, the biomarker measures being biomarker measures for each item in a set of biomarkers, the method comprising: (i) accessing, using at least one processor, a set of electronically stored training data vectors, each training data vector representing an individual and comprising a biomarker metric for each biomarker in the corresponding person biomarker set, each training data vector further comprising a classification as to whether the corresponding person has diagnosed NSCLC; (ii) training an electronic representation of a classification system using the electronically stored set of training data vectors; (iii) receiving, at least one processor, test data comprising a plurality of biomarker metrics for a set of biomarkers of a human test subject; (iv) evaluating, using at least one processor, the test data, the evaluating being performed using an electronic representation of a classification system; and (v) outputting a classification of the human test subject, the classification being a classification as to the likelihood of presence or development of non-small cell lung cancer in the subject based on the assessing step, wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

In a preferred embodiment, the test data comprises two or more repeated data vectors each comprising an individual determination (individualization) of a biomarker metric for a plurality of biomarkers in a physiological sample from a human subject, in which case a sample may be classified as likely to have development of NSCLC if any of the repeated data vectors is classified as positive for NSCLC according to any of the classifiers in the classification system. Optionally, the test data and each training data vector further comprises at least one other feature selected from the group consisting of: the sex, race, ethnicity and/or nationality of the individual, age and smoking status.

The biomarker sets of the various modes of the invention may comprise 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 or 33 biomarkers.

The biomarker metric is proportional to the respective concentration levels of the biomarkers in the physiological sample selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, CYFRA-21-1, MIF, sICAM-1, SAA, or combinations thereof, the physiological sample being a biological fluid. Alternatively, the biomarker metric may be proportional to the respective concentration levels of the biomarkers selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, IL-10, and NSE. In another further embodiment, the biomarker metric is proportional to the respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, and IL-10. In another further embodiment, the biomarker is proportional to the corresponding concentration level of a biomarker selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, resistin, MPO, NSE, GRO, CEA, CXCL9, MIF, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, or a combination thereof, and the physiological sample is a biological fluid. In another further embodiment, the biomarker metric is proportional to the respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, and IL-2. In another further embodiment, the biomarker metric is proportional to the respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, and leptin. In another further embodiment, the biomarker is proportional to the respective concentration level of the biomarker selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO, CEA, CXCL9, IL-2, SAA, PDFG-AB/BB, or a combination thereof, and the physiological sample is a biological fluid. In another further embodiment, the biomarker metric is proportional to the respective concentration level of a biomarker selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, and MMP-7.

The method of the invention further comprises determining a biomarker metric in a physiological sample from the subject. Typically, the various biomarkers are peptides, proteins, peptides and proteins carrying post-translational modifications, or combinations thereof, and the biological fluid is blood, serum, plasma, or mixtures thereof. In a preferred form of any mode of the invention, the classification system is a random forest and preferably the random forest classifier comprises 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 individual trees (individual trees).

Typically, in the methods of the invention, the subject is a human, either female or male. In a preferred embodiment of the invention, the subject exhibits at least one lung nodule detectable by a computed tomography scan. For example, the method may further comprise testing the lung nodule by low dose computed tomography. In other embodiments, the subject is at risk for having NSCLC, and/or the method may further comprise the step of treating NSCLC in said subject. In a particularly preferred embodiment of the invention, the subject (or patient) is a long-term smoker, is 45 years old or older, has been diagnosed with indeterminate nodules in the lungs, or a combination thereof.

In a particularly preferred mode, the present invention provides a method of classifying test data, the test data comprising a plurality of biomarker measures, the biomarker measures being biomarker measures for each item in a set of biomarkers, the method comprising: (a) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject; (b) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each of the classifiers having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding set of human biomarkers, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and (c) outputting, using at least one processor, a classification of the sample from a human test subject based on the classification of the likelihood of presence or progression of NSCLC in the subject based on the assessing step, wherein the biomarker set comprises at least eight (8) biomarkers selected from the group consisting of: IL-8, sTNFRII, MMP-9, TNFRI, CXCL9-MIG, resistin, SAA, MPO, PDGF-AB-BB, MMP-7, GRO, MIF, MCP-1, CEA, CYFRA-21-1, leptin, IL-2, IL-10, and NSE.

In other modes, the invention provides a system for classifying test data, the test data comprising a plurality of biomarker measures, the biomarker measures being biomarker measures for respective items in a set of biomarkers, the system comprising: at least one processor coupled to an electronic storage means (electronic storage means) comprising an electronic representation of a classifier that has been trained using an electronically stored set of training data vectors, the processor being configured to receive test data comprising a plurality of biomarker metrics for a set of biomarkers in a human test subject according to any one of the preceding claims, the at least one processor being further configured to evaluate the test data using the electronic representation of the one or more classifiers and output a classification of the human test subject based on the evaluation result, wherein the set of biomarkers comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4. Alternatively, the present invention provides a non-transitory computer readable storage medium having an executable program stored thereon, wherein the program instructs a microprocessor to perform the steps of: (i) receiving a biomarker metric for a plurality of biomarkers in a physiological sample of a subject; and (ii) classifying, using a classification system and at least one processor, the sample based on the biomarker metric, wherein the classification of the sample is indicative of a likelihood of presence or progression of non-small cell lung cancer (NSCLC) in the subject, wherein the biomarker set comprises at least nine (9) biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4.

The method of the present invention may further comprise: (a) obtaining a physiological sample from a subject; and (b) measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4. The method may comprise measuring a set of at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 21 biomarkers in the sample. The biomarker metric may be indicative of non-small cell lung cancer. The biomarker metric may be indicative of early stage non-small cell lung cancer, preferably stage I. In various embodiments, the subject may be at risk for non-small cell lung cancer.

The method of the invention may further comprise measuring in a physiological sample obtained from the subject a set of at least four biomarkers in the sample selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4 to produce biomarker metrics. The method may comprise measuring a set of at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 21 biomarkers in the sample. The biomarker metric may be indicative of non-small cell lung cancer. The biomarker metric may be indicative of early stage non-small cell lung cancer, preferably stage I. In various embodiments, the subject may be at risk for non-small cell lung cancer.

In various embodiments, measuring the biomarker metric may be measured by: radio-immunoassay, enzyme-linked immunosorbent assay (ELISA), Q-PlexTMMultiplex assays, liquid chromatography-mass spectrometry (LCMS), flow cytometry multiplex immunoassays, high pressure liquid chromatography with radiation or spectroscopic detection by visible or ultraviolet light absorbance, massSpectroscopic and quantitative analysis, western blotting, one-or two-dimensional gel electrophoresis with quantitative visualization by detection of radioactive, fluorescent or chemiluminescent probes or nuclei, antibody-based detection with absorbance or fluorometry, quantification by luminescence of any of a variety of chemiluminescent reporter systems, enzymatic assays, immunoprecipitation or immunocapture assays, solid and liquid phase immunoassays, quantitative multiplex immunoassays, protein arrays or chips, plate assays, printed array immunoassays or combinations thereof. In a preferred embodiment, the biomarker metric may be measured by immunoassay.

The invention also provides a method for diagnosing stage I non-small cell lung cancer, comprising: (a) obtaining a physiological sample from a subject; (b) measuring a set of four to thirty-three biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4; (c) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject; (d) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding human biomarker set, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and (e) outputting, using at least one processor, a classification of the sample from the human test subject, the classification being a classification as to the likelihood of the presence or progression of NSCLC in the subject based on the assessing step. In some embodiments, the classification system may be selected from the group consisting of: random forest, AdaBoost, Naive Bayes (Naive Bayes), Support Vector machines (Support Vector Machine), LASSO, Ridge Regression (Ridge Regression), Neural networks, genetic algorithms, Elastic networks (Elastic nets), Gradient Boosting trees (Gradient Boosting trees), Bayesian Neural networks (Bayesian Neural networks), k-Nearest neighbors (k-Nearest neighbors) or a collection thereof. The biomarker may be a peptide, a protein, a peptide and a protein carrying a post-translational modification, or a combination thereof. The physiological sample may be whole blood, plasma, serum or a combination thereof.

The present invention also provides a method for diagnosing stage I non-small cell lung cancer, comprising measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4; (c) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject; (d) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding human biomarker set, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and (e) outputting, using at least one processor, a classification of the sample from the human test subject, the classification being a classification as to the likelihood of the presence or progression of NSCLC in the subject based on the assessing step. In some embodiments, the classification system may be selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof. The biomarker may be a peptide, a protein, a peptide and a protein carrying a post-translational modification, or a combination thereof. The physiological sample may be whole blood, plasma, serum or a combination thereof.

In many embodiments, a method for detecting a plurality of biomarkers can comprise: (a) obtaining a physiological sample from a subject; and (b) measuring a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4. The biomarker metric may be indicative of non-small cell lung cancer. The biomarker metric may be indicative of early stage non-small cell lung cancer, optionally stage I non-small cell lung cancer. The biomarker metric may not be indicative of asthma, breast cancer, prostate cancer, pancreatic cancer, or a combination thereof. In many embodiments, the patient may be at risk for non-small cell lung cancer.

In many embodiments, a method for detecting a plurality of biomarkers can include measuring, in a physiological sample obtained from a subject, a set of at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4. The biomarker metric may be indicative of non-small cell lung cancer. The biomarker metric may be indicative of early stage non-small cell lung cancer, optionally stage I non-small cell lung cancer. The biomarker metric may not be indicative of asthma, breast cancer, prostate cancer, pancreatic cancer, or a combination thereof. In many embodiments, the patient may be at risk for non-small cell lung cancer.

The set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, SAA, IL-2, and PDGF-AB/BB. The set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, and SAA. The set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO-Pan, CEA, CXCL9/MIG, SAA, IL-2, and PDGF-AB/BB.

In various embodiments, the set can include at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 21 biomarkers.

In various embodiments, the biomarker can be a peptide, a protein, a peptide carrying a post-translational modification, a protein carrying a post-translational modification, or a combination thereof.

In various embodiments, the physiological sample can be whole blood, plasma, serum, or a combination thereof.

In various embodiments, the method may further comprise: (a) receiving, on at least one processor, test data comprising biomarker metrics for each biomarker in a set of biomarkers in a physiological sample from a human test subject; (b) evaluating, using at least one processor, the test data, the evaluating being performed using classifiers that are electronic representations of a classification system, each classifier having been trained using a set of electronically stored training data vectors, each training data vector representing an individual and comprising biomarker metrics for each biomarker in the corresponding human biomarker set, each training data vector further comprising a classification as to whether the corresponding human has diagnosed NSCLC; and (c) outputting, using at least one processor, a classification of the sample from the human test subject, the classification being a classification as to the likelihood of the presence or progression of NSCLC in the subject based on the assessing step.

In many preferred embodiments, the classification system may be selected from one or more algorithms of random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

The present invention also provides a method for determining the presence of non-small cell lung cancer early in the development of a disease by measuring the expression levels of a set of biomarkers in a subject, comprising: determining a biomarker metric for a biomarker set in a physiological sample by immunoassay, wherein the biomarker set comprises at least four biomarkers selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO, CEA, leptin, CXCL9, HGF, sCD40L, CYFRA-21-1, sFasL, RANTES, IL-7, MIF, sICAM-1, IL-2, SAA, IL-16, IL-9, PDFG-AB/BB, sEFGR, LIF, IL-12p70, CA125, and IL-4; classifying said sample for the presence or development of non-small cell lung cancer in said subject using said biomarker metric in a classification system.

In many embodiments, the set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, SAA, IL-2, and PDGF-AB/BB.

In many embodiments, the set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, and SAA.

In many embodiments, the set of at least four biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO-Pan, CEA, CXCL9/MIG, SAA, IL-2, and PDGF-AB/BB.

In any of the preceding embodiments, the set may comprise at least 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 19 biomarkers.

In any of the preceding embodiments, the classification system may be selected from the group consisting of: random forest, AdaBoost, naive bayes, support vector machines, LASSO, ridge regression, neural networks, genetic algorithms, elastic networks, gradient-enhanced trees, bayesian neural networks, k-nearest neighbors, or a collection thereof.

In any of the preceding embodiments of the invention, the biomarker may be a peptide, a protein, a peptide carrying a post-translational modification, a protein carrying a post-translational modification, or a combination thereof.

In any of the preceding embodiments of the invention, the physiological sample may be whole blood, plasma, serum or a combination thereof.

In any of the preceding embodiments of the invention, the biological fluid may be whole blood, plasma, serum, sputum, urine, sweat, lymph and alveolar lavage fluid.

The methods and systems provided herein are generally capable of diagnosing and prognosing lung lesions (e.g., cancerous) with an accuracy of over 90% (e.g., total correct rate over total test volume). These results provide significant advances over currently available methods for diagnosing and prognosing non-small cell lung cancer.

Background

Brief description of the drawings

FIGS. 1A-B depict the ROC curves for 33, 19 and 13 biomarkers. This indicates that these two models have a good ability to discriminate between NSCLC (fig. 1A) and non-NSCLC cancers (fig. 1B).

Detailed Description

The present invention relates to various methods for detecting, identifying and diagnosing lung disease using biomarkers. The methods involve determining biomarker metrics for specific biomarkers and using these biomarker metrics in a classification system to determine the likelihood of an individual developing non-small cell lung cancer. The invention also provides kits comprising detection agents for detecting these biomarkers, or means (means) for determining biomarker metrics for these biomarkers, as components of a system for aiding in the determination of the likelihood of non-small cell lung cancer. Exemplary biomarkers were identified by measuring the expression levels of eighty-two selected biomarkers in the plasma of patients from a population that has demonstrated diagnostic potential for early stage lung cancer. This process is detailed in example 1.

Described herein is an in vitro diagnostic multivariate index test (IVDMIA) employing an algorithm that uses multiple protein biomarkers and patient demographics to generate a qualitative single-score classifier of the presence of "yes" or "no" for early stage non-small cell lung cancer. The IVDMIA test described in this example can be used as a secondary risk stratification model for patients who find during a preliminary diagnostic test (i.e. CT scan) that the lungs have nodules (when it is unclear whether the nodules are cancerous). This test may assist the physician in selecting an appropriate subsequent non-small cell lung cancer (NSCLC) diagnostic procedure. For example, this test can be used to screen individuals at high risk for developing non-small cell lung cancer, such as smokers over the age of forty-five.

Definition of

As used herein, "biomarker" or "marker" refers broadly to a biological molecule that can be objectively measured as a characteristic indicator of a physiological state of a biological system. For the purposes of this disclosure, biomolecules include ions, small molecules, peptides, proteins, peptides and proteins that carry post-translational modifications, nucleosides, nucleotides and polynucleotides, including RNA and DNA, glycoproteins, lipoproteins, and various covalent and non-covalent modifications of these types of molecules. Biomolecules include any of these entities that are native, characteristic, and/or essential to the function of a biological system. Most of the biomarkers are polypeptides, although they may also be mrnas or modified mrnas, which represent the pre-translational form of the gene product expressed as a polypeptide, or they may include post-translational modifications of the polypeptide.

The term "biomarker metric" as used herein generally refers to information relating to a biomarker that can be used to characterize the presence or absence of a disease. Such information may include measured values that are proportional to concentration or otherwise provide a qualitative or quantitative indication of biomarker expression in a tissue or biological fluid. Each biomarker may be represented as a dimension in a vector space, where each vector is a multi-dimensional vector in the vector space and includes a plurality of biomarker metrics associated with a particular subject.

"classifier" as used herein refers broadly to machine learning algorithms such as support vector machines, AdaBoost classifiers, penalized logistic regression, elastic nets, regression tree systems, gradient tree enhancement systems, naive Bayes(s)

Figure BDA0002301358220000151

Bayes), neural net, bayesian neural network, k-nearest neighbor classifier, and random forest. The present invention contemplates methods using any of the listed classifiers and uses of combinations of more than one classifier.

As used herein, "classification system" broadly refers to a machine learning system that executes at least one classifier.

As used herein, a "subset" is a proper subset, and a "superset" is a proper superset.

The term "subject" as used herein broadly refers to any animal, but preferably a mammal, e.g., a human. In many embodiments, the subject is a human patient having or at risk of having a lung disease.

As used herein, "physiological sample" broadly refers to samples from biological fluids and tissues. Biological fluids include whole blood, plasma, serum, sputum, urine, sweat, lymph and alveolar lavage. Tissue samples include biopsies from solid lung tissue or other solid tissues, lymph node biopsies, biopsies of metastases. Methods of obtaining physiological samples are described in the art.

As used herein, "detection agent" refers broadly to reagents and systems that specifically detect the biomarkers described herein. Detection agents include reagents, such as antibodies, nucleic acid probes, aptamers, lectins, or other reagents, having a particular affinity for a particular marker or markers sufficient to distinguish the particular marker from other markers that may be present in a sample of interest, and systems, such as sensors, including sensors that utilize bound or otherwise immobilized reagents as described above.

"classification and regression tree (CART)" as used herein generally refers to a method of creating decision trees based on recursively partitioning a data space in order to optimize certain metrics, typically model performance.

As used herein, "AdaBoost" generally refers to a bagging method that iteratively fits CART reweigh observations (observation) by errors generated at a previous iteration.

As used herein, "False Positive (FP)" broadly refers to an error in which an algorithmic test result indicates the presence of disease when it is in fact absent.

As used herein, "False Negative (FN)" broadly refers to an error in which the results of an algorithmic test indicate the absence of disease when it is in fact present.

"genetic algorithm" as used herein broadly refers to an algorithm that models genetic mutations for optimizing function (e.g., model performance).

As used herein, "in-assay accuracy" reflects the reproducibility of an assay using in-plate measurements of individual plasma samples. The intra-assay% CV was calculated by dividing the mean (M) MFI of all replicates of individual plasma by the Standard Deviation (SD) of all replicates and multiplying by 100,% CV-100 (SD/M). Lower concentrations may result in poorer accuracy.

As used herein, "inter-assay accuracy" reflects the reproducibility of the assay using measurements from different plates, days, and operators of individual plasma samples. The% CV between trials was calculated by dividing the mean MFI of all replicates from all individual plasma runs by the Standard Deviation (SD) of all replicates and multiplying by 100,% CV (SD/M) × 100. Lower concentrations may result in poorer accuracy.

As used herein, the "L1 Norm (Norm)" is the sum of the absolute values of the elements of a vector

As used herein, the "L2 Norm (Norm)" is the square root of the sum of the squares of the vector elements.

As used herein, "limit of detection" (LOD) is calculated from the mean median measurement of the blank plus 2SD, LOD — M +2 SD. This value is less than or equal to LLOQ and is not necessarily quantifiable.

As used herein, "lower limit of quantitation (LLOQ)" is the lowest concentration of an analyte in a sample that can be quantitatively determined with suitable precision and accuracy. In most cases, LLOQ exceeds LOD, but the two values may be equal. The parameters determined for LLOQ were within 20% CV and ± 20% (80-120%) recovery.

As used herein, "percent coefficient of variation (% CV)" is calculated as follows: standard Deviation (SD) is divided by mean (M) and expressed as a percentage.

As used herein, "Negative Predictive Value (NPV)" is the number of True Negatives (TN) divided by the number of True Negatives (TN) plus the number of false negatives (FP), TP/(TN + FN).

As used herein, "positive predictive value (NPV)" is the number of True Positives (TP) divided by the number of True Positives (TP) plus the number of False Positives (FP), TP/(TP + FP).

As used herein, "accuracy" is used to indicate the degree of dispersion (spread) between a series of measurements and includes repeatability (intra-assay) and reproducibility (inter-assay).

As used herein, "Perceptron" refers to a method of separating a set of observations based on the dot product of a vector of observations and a set of weights.

As used herein, "Neural Net" refers to a classification method that links together perceptron-like objects to create a classifier.

"LASSO" as used herein generally refers to a method for performing linear regression with L1 norm constraints on the regression coefficient vector.

As used herein, "random forest" broadly refers to a bagging method that fits CART based on samples from a model-trained dataset.

As used herein, "ridge regression" broadly refers to a method for performing linear regression with constraints on the L2 norm of the regression coefficient vector.

As used herein, "elastic net" broadly refers to a method for linear regression with approximately elements including a linear combination of the L2 norm and the L1 norm of the regression coefficient vector.

As used herein, "sensitivity" is the probability of a positive outcome in a NSCLC patient. The sensitivity calculation method comprises the following steps: dividing the number of True Positives (TP) by the total number of actual NSCLC patients, or the number of True Positives (TP) plus the number of False Negatives (FN); sensitivity is TP/(TP + FN).

As used herein, "specificity" is the probability that a patient does not have NSCLC. The calculation method of the specificity comprises the following steps: dividing the number of True Negatives (TN) by the total number of actual non-NSCLC patients, or the number of True Negatives (TN) plus the number of False Positives (FP); specificity is TN/(TN + FP).

As used herein, "Standard Deviation (SD)" is the degree of dispersion (spread) in individual data points (i.e., in the replicate group) to reflect the uncertainty of a single measurement.

As used herein, a "training set" is a set of samples used to train and develop a machine learning system, such as the algorithm of the present invention.

As used herein, "True Negative (TN)" is the result of an algorithmic test that indicates absence of disease when it is in fact absent.

As used herein, "true positive (FP)" is the result of an algorithmic test that indicates the presence of disease when it is actually present.

As used herein, "upper limit of quantitation (ULOQ)" is the highest concentration of an analyte in a sample that can be quantitatively determined with suitable precision and accuracy. The parameters determined for ULOQ were within 20% CV and ± 20% (80-120%) recovery.

As used herein, a "validation set" is a sample set that is a blind study and is used to confirm the functionality of an algorithm developed according to the present invention. This is also referred to as the blind test set.

Determining biomarker metrics

Biomarker metrics are information that is typically related to a quantitative measure of an expression product (typically a protein or polypeptide). The present invention contemplates determining biomarker metrics at the protein level (which may include post-translational modifications). In particular, the invention contemplates determining changes in biomarker concentration, reflected in an increase or decrease in the level of transcription, translation, post-transcriptional modification, or degree or extent of protein degradation, wherein such changes are associated with a particular disease state or disease progression.

Many proteins expressed by normal subjects are expressed to varying degrees (greater or lesser) in subjects with lung disease such as non-small cell lung cancer. It will be appreciated by those skilled in the art that most diseases exhibit variations in a number of different biomarkers. Thus, a disease can be characterized by the expression pattern of various markers. Determining the expression level of multiple biomarkers facilitates observing expression patterns, and such patterns provide a more sensitive and more accurate diagnosis than detecting individual biomarkers. The pattern may include abnormal elevations for some specific biomarkers, while also including abnormal reductions for other specific biomarkers.

According to the present invention, a physiological sample is collected from a subject in a manner that ensures that the amount of a biomarker in the sample is proportional to the concentration of that biomarker in the subject from which the sample was collected. The measurement is performed such that the measured value is proportional to the concentration of the biomarker in the sample. It is within the ordinary skill in the art to select sampling and measurement techniques that meet these requirements.

Those skilled in the art will appreciate that for individual biomarkers, various methods for determining biomarker metrics are known in the art. See, Methods of Instrumental Analysis (instruments of Analysis), seventh edition, 1988. Such assays may be performed in multiplex or matrix-based formats, such as multiplex immunoassays.

Many methods of determining biomarker metrics are known in the art. Means for such assays include, but are not limited to, radio-immunoassays, enzyme-linked immunosorbent assays (ELISAs), Q-PlexTMMultiplex assays, liquid chromatography-mass spectrometry (LCMS), flow cytometry multiplex immunoassays, high pressure liquid chromatography with radiation or spectroscopic detection by visible or ultraviolet light absorbance, mass spectrometric qualitative and quantitative analysis, western blotting, one or two dimensional gel electrophoresis with quantitative visualization by detecting radioactive, fluorescent or chemiluminescent probes or nuclei, antibody-based detection with absorption or fluorometry, quantitation by luminescence of any of a variety of chemiluminescent reporter systems, enzymatic assays, immunoprecipitation or immunocapture assays, solid and liquid phase immunoassays, protein arrays or chips, plate assays, assays using molecules with binding affinities that allow resolution of, for example, aptamers and molecularly imprinted polymers, and any other quantitative analytical determination of biomarker concentrations by any other suitable technique, and instrument actuation (instrumentation ac) of any such detection technique or instrumenttune). A particularly preferred method for determining biomarker metrics comprises printed array immunoassay.

The step of determining a biomarker metric may be performed by any means known in the art, particularly those means discussed herein. In a preferred embodiment, the step of determining a biomarker metric comprises performing an immunoassay with an antibody. One skilled in the art will be readily able to select suitable antibodies for use in the present invention. The selected antibody is preferably selective for the antigen of interest (i.e., selective for a particular biomarker), has high binding specificity for the antigen, and has minimal cross-reactivity with other antigens. The ability of an antibody to bind to an antigen of interest can be determined, for example, by known methods such as enzyme-linked immunosorbent assay (ELISA), flow cytometry and immunohistochemistry. In addition, the antibody should have a relatively high binding specificity for the antigen of interest. The binding specificity of an antibody can be determined by known methods, such as immunoprecipitation or by in vitro binding assays, such as Radioimmunoassay (RIA) or ELISA. Disclosed methods are provided for selecting antibodies capable of binding an antigen of interest with high binding specificity and minimal cross-reactivity, for example, in U.S. patent No. 7,288,249.

In a preferred embodiment, a single molecule array format may be used. In this method, a single protein molecule is captured and labeled on a bead using standard immunoadsorption assay reagents. Thousands of beads (with or without immunoconjugate) were mixed with enzyme substrate and loaded into individual femtoliter-sized wells and sealed with oil. The fluorophore concentration of each bead is counted to determine whether it binds to the target analyte. A disclosure of such a process is provided, for example, in U.S. patent No. 8,236,574.

Biomarker metrics for biomarkers indicative of lung disease may be used as input to a classification system comprising the classifiers described herein, alone or in combination. Each biomarker may be represented as a dimension in a vector space, where each vector consists of multiple biomarker metrics associated with a particular object. Thus, the dimension of the vector space corresponds to the size of the biomarker set. The mode of biomarker metrics for a plurality of biomarkers can be used in various diagnostic and prognostic methods. The present invention provides such a method. Exemplary methods include using classifiers such as support vector machines, AdaBoost, penalized logistic regression, regression tree systems, naive bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, or any combination thereof.

Classification system

The invention relates in particular to the prediction of lung lesions as cancerous based on a plurality of continuously distributed biomarkers. For some classification systems that use classifiers (e.g., support vector machines, AdaBoost, penalized logistic regression, regression tree systems, naive bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, or any combination thereof), the prediction may be a multi-step process (e.g., a two-step process, a three-step process, etc.), and the prediction may be a multi-step process (e.g., a two-step process, a three-step process, etc.).

As used herein, the classification system may include computer-executable software, firmware, hardware, or various combinations thereof. For example, the classification system may include a reference to the processor and a supporting data store. Further, the classification system may be implemented on multiple devices or other components, local or remote to one another. The classification system may be implemented in a centralized system or as a distributed system in other scalability aspects. Moreover, any reference to software may include a non-transitory computer readable medium that when executed on a computer causes the computer to perform a series of steps.

The classification system described herein may include data storage, such as network accessible storage, local storage, remote storage, or a combination thereof. Data storage may utilize redundant arrays of inexpensive disks ("RAID"), tapes, disks, storage area networks ("SAN"), internet small computer system interface ("iSCSI") SAN, fibre channel SAN, common internet archive system ("CIFS"), network attached storage ("NAS"), network file system ("NFS"), or other computer accessible storage. In one or more embodiments, the data store may be a database, such as an Oracle database, a Microsoft (Microsoft) SQL Server database, a DB2 database, a MySQL database, a seebecs (Sybase) database, an object-oriented database, a hierarchical database, or other database. The data store may utilize a flat file structure to store data.

In a first step, a predetermined data set is described using a classifier. This is the "learning step" and is done on the "training" data.

The training database is a computer implemented data store reflecting a plurality of biomarker metrics for a plurality of persons, associated with a classification regarding each respective person's disease state. The format of the stored data may be a flat file, a database, a table, or any other retrievable data storage format known in the art. In an exemplary embodiment, the test data is stored as a plurality of vectors, each quantity corresponding to a person, each quantity comprising a plurality of biomarker metrics for a plurality of biomarkers and a classification regarding a disease state of the person. Typically, each vector comprises an entry for each biomarker metric of the plurality of biomarker metrics. The training database may be linked to a network, such as the internet, so that its contents can be remotely retrieved by an authorized entity (e.g., a human user or a computer program). Alternatively, the training database may be located in a network-isolated computer.

In a second step (which is an optional step), the classifier is applied to a "validation" database and various measures of accuracy are observed, including sensitivity and specificity. In an exemplary embodiment, only a portion of the training database is used for the learning step, while the remainder of the training database is used as the validation database. In a third step, the biomarker metrics from the subject are submitted to a classification system that outputs a classification (e.g., disease state) calculated for the subject.

Several methods for classification are known in the art, including the use of classifiers such as support vector machines, AdaBoost, decision trees, Bayesian classifiers, Bayesian belief networks (Bayesian belief networks), na iotave Bayesian classifiers, k-nearest neighbor classifiers, case-based reasoning, penalized logistic regression, neural nets, random forests or any combination thereof (see, e.g., Han J and Kamber M,2006, chapter 6, "data mining, Concepts and technologies" (DataMining, Concepts and technologies), 2 nd edition, einwei (Elsevier): amsterdan.). Any classifier or combination of classifiers may be used in the classification system, as described herein.

Classifier

There are many possible classifiers that can be used for data. As non-limiting examples and as described below, classifiers, such as support vector machines, genetic algorithms, penalized logistic regression, LASSO, ridge regression, naive bayes classifier, classification trees, k-nearest neighbor classifiers, neural nets, elastic networks, bayesian neural networks, random forests, gradient enhancement trees, and/or AdaBoost may be used to classify data. As discussed herein, the data may be used to train a classifier.

Classification tree

Classification trees are easy-to-interpret classifiers with built-in feature selection. The classification tree recursively partitions the data space in this manner maximizing the proportion of observations from one class in each subspace.

The process of recursively partitioning the data space produces a binary tree, provided that tests are performed at each vertex. The new observations are classified by tracking the branches of the tree until the leaves are reached. At each leaf, a probability is assigned to an observation belonging to a given category. The class with the highest probability is the class to which the new observation is classified.

A classification tree is essentially a decision tree whose attributes are expressed in a statistical language. They have high flexibility but are very noisy (the error is very different compared to other methods).

The tools discussed herein for implementing a classification tree may be used in statistical software computing languages and environments R. For example, version 1.0-28 of the R software package "Tree" includes tools for creating, processing, and utilizing classification trees.

Random forest

Classification trees are typically noisy. Random forests attempt to reduce this noise by taking the average of many trees. The result is that the error of the classifier has a reduced variance compared to the classification tree.

To plant a forest, the following algorithm is used:

1. for B1 to B, where B is the number of trees to be planted in the forest,

a. extraction of Bootstrap sample (bootstrap sample)1.

b. Planting classification tree T on bootstrap sampleb

2. Output of

Figure BDA0002301358220000221

And (4) collecting. The set is a random forest.

To classify new observations using random forests, the new observations are classified using classification trees in the random forests. The type to which the new observations are most often classified in the classification tree is the type to which the random forest classifies the new observations.

Random forests reduce many of the problems present in classification trees, but at the expense of interpretability (interpretivity).

The tools discussed herein for implementing random forests can be used for statistical software computing languages and environments, R. For example, the R software package of version 4.6-2, "random forest," includes tools for creating, processing, and utilizing random forests.

AdaBoost (adaptive enhancement)

AdaBoost provides for classifying n objects into two or more classes based on a k-dimensional vector (referred to as a k-tuple) of object measurements2The disease category of (a). AdaBoost employs a series of "weak" classifiers that, while performing better than stochastic prediction, perform poorly, and combine them to create an advanced classifier3. The weak classifiers used by AdaBoost are classification and regression trees (CART). CART recursively divides the data space into regions in which all new observations that lie within the region are assigned a particular class label. AdaBoost constructs a series of CART's from a weighted version of the dataset,the weights of the weighted versions of the dataset depend on the performance of the classifiers in the previous iteration (Han J and Kamber M, (2006) & Data Mining, concepts and technologies, 2 nd edition.

Figure DA00023013582271286

Method for classifying data using classification system

The present invention provides a method of classifying data (test data, i.e., biomarker metrics) obtained from an individual. These methods involve preparing or obtaining training data and evaluating test data (as compared to the training data) obtained from an individual using one of the classification systems that includes at least one of the above-described classifiers. Preferred classification systems use classifiers, such as learning machines, including, for example, Support Vector Machines (SVMs), AdaBoost, penalized logistic regression, naive bayes classifiers, classification trees, k-nearest neighbor classifiers, neural networks, random forests, and/or combinations thereof. The classification system outputs a classification of the individual based on the test data.

Particularly preferred for the present invention is an integrated method for a classification system that incorporates multiple classifiers. For example, the integration method may include SVM, AdaBoost, penalized logistic regression, naive bayes classifier, classification trees, k-nearest neighbor classifier, neural net, random forest or any combination thereof, in order to make predictions about disease pathology (e.g., NSCLC or normal). An integrated approach was developed to take advantage of the benefits provided by each classifier, as well as the repeated measurements of each plasma sample.

For a plurality of samples, a biomarker metric for each biomarker in each subject's plasma is obtained. Typically, plasma samples are collected and a complete biomarker metric is obtained for each sample. Each subject can be predicted as having a disease state (e.g., NSCLC or normal) using a classification system including at least one classifier based on each repeated measurement (e.g., double repeat, triple repeat) to produce a plurality of predictions (e.g., four predictions, six predictions). In a preferred mode of the invention, the integrated method may predict that the subject has NSCLC if at least one prediction is NSCLC and all other predictions predict that the subject is normal. If only one of the classifiers predicts positive for NSCLC, then a decision is made to predict that the subject has NSCLC, so that the integration approach is as conservative as possible. In other words, the test is intended to bias towards identifying subjects as having NSCLC, thereby minimizing the number of false negatives, which are errors that are more severe than false positive errors. An integrated method may predict that a subject has NSCLC, for example, if at least two, or at least three, or at least four, or at least five, up to all predictions are positive for NSCLC.

The test data may be any biomarker metric, such as plasma concentration measurements of multiple biomarkers. In one embodiment, the present invention provides a method of classifying test data, the test data including biomarker metrics that are a plurality of plasma concentration metrics for each of a set of biomarkers, comprising: (a) accessing a set of electronically stored training data vectors, each training data vector or k-tuple representing an individual and including, for each repetition, a biomarker metric (i.e., a plasma concentration metric for each biomarker set) for that corresponding individual, the training data vectors further including a classification regarding a disease state for each corresponding individual; (b) training an electronic representation of a classifier or set of classifiers described herein using an electronically stored set of training data vectors; (c) receiving test data comprising a plurality of plasma concentration metrics for a human test subject; (d) evaluating test data using an electronic representation of a classifier and/or set of classifiers described herein; and (e) outputting a classification of the human test subject based on the evaluating step. In another embodiment, the invention provides a method of classifying test data, the test data comprising biomarker metrics that are a plurality of plasma concentration metrics for each of a set of biomarkers, comprising: (a) accessing a set of electronically stored training data vectors, each training data vector or k-tuple representing an individual and comprising a biomarker metric, such as a plasma concentration metric, for each biomarker set of each duplicate corresponding individual, the training data vectors further comprising a classification regarding a disease state of each corresponding individual; (b) establishing a classifier or set of classifiers using an electronically stored set of training data vectors; (c) receiving test data comprising a plurality of plasma concentration metrics for a human test subject; (d) evaluating the test data using a classifier; and (e) outputting a classification of the human test subject based on the evaluating step. Alternatively, all replicates (or any combination thereof) may be averaged to produce a single value for each biomarker for each subject. The output according to the present invention comprises displaying information about the classification of the human test subject in human readable form in an electronic display.

The classification of a disease state may be the presence or absence of a disease state. The disease state according to the invention may be a lung disease, such as non-small cell lung cancer.

The set of training vectors may include at least 20, 25, 30, 35, 50, 75, 100, 125, 150 or more vectors.

It should be understood that the method of classifying data may be used in any of the methods described herein. In particular, the methods of classifying data described herein can be used in physiological characterization methods (based in part on the classification of the invention) as well as in methods for diagnosing lung diseases such as non-small cell lung cancer.

Classifying data using a reduced number of biomarkers

The invention also provides methods of classifying data relating to a reduced set of biomarkers, such as test data obtained from an individual. That is, the training data may be decremented (thinned) to exclude all subsets except the subset of biomarker metrics for the selected subset of biomarkers. Likewise, the test data may be limited to a subset of biomarker measures from the same selected biomarker set.

The biomarker may be selected from the group consisting of: bNGF, CA-125, CEA, CYFRA21-1, EGFR/HER1/ErBB1, GM-CSF, granzyme B, Gro-alpha, ErbB2/HER2, HGF, IFN-a2, IFN-b, IFN-g, IL-10, IL-12p40, IL-12p70, IL-13, IL-15, IL-16, IL-17A, IL-17F, IL-1a, IL-1b, IL-1ra, IL-2, IL-20, IL-21, IL-22, IL-23p19, IL-27, IL-2ra, IL-3, IL-31, IL-4, IL-5, IL-6, IL-7, IL-8, IL-9, IP-10, I-TAC, leptin, LIF, MCP-1, MCP, LIF, LIP, and the like, MCP-3, M-CSF, MIF, MIG, MIP-1a, MIP-1b, MIP-3a, MMP-7, MMP9, MPO, NSE, OPG, PAI-1, PDGF-AB/BB, PDGF, RANTES, resistin, SAA, sCD 40-ligand, SCF, SDF-1, SE-selectin, sFas ligand, sICAM-1, RANKL, TNFRI, TNFRII, sVCAM-1, TGF- α, TGF- β, TNF- α, TNF- β, TPO, TRAIL, TSP1, TSP2, VEGF-A, VEGF-C, and combinations thereof.

The biomarker may be selected from the group consisting of: IL-4, sEGFR, leptin, NSE, MCP-1, GRO-pan, IL-10, IL-12P70, sCD40L, IL-7, IL-9, IL-2, IL-5, IL-8, IL-16, LIF, CXCL9/MIG, HGF, MIF, MMP-7, MMP-9, sFasL, CYFRA21-1, CA125, CEA, sICAM-1, MPO, RANTES, PDGF-AB/BB, resistin, SAA, TNFRI, sTNFRII, and combinations thereof.

The biomarker may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, MMP-7, IL-5, resistin, IL-10, MPO, NSE, MCP-1, GRO-Pan, CEA, leptin, CXCL9/MIG, CYFRA21-1, MIF, sICAM-1, SAA, and combinations thereof.

The set of biomarkers may be selected from the group consisting of: IL-8, MMP-9, sTNFRII, TNFRI, resistin, MPO, NSE, GRO-Pan, CEA, CXCL9/MIG, IL-2, SAA, PDGF-AB/BB, and combinations thereof.

In one embodiment, the present invention provides a method of classifying test data, the test data including biomarker metrics that are a plurality of plasma concentration metrics for each of a set of biomarkers, comprising: (a) accessing a set of electronically stored training data vectors, each training data vector representing an individual and comprising a biomarker metric for each biomarker in a corresponding person's biomarker set, each training data vector further comprising a classification regarding the corresponding person's disease state; (b) selecting a subset of biomarkers from the set of biomarkers; (c) training an electronic representation of a learning machine (a classifier or a set of classifiers as described herein) using data from a biomarker subset of an electronically stored training data vector set; (d) receiving test data comprising a plurality of plasma concentration metrics of the human test subject associated with the biomarker set in step (a); (e) evaluating the test data using an electronic representation of the learning machine; and (f) outputting a classification of the human test subject based on the evaluating step.

The methods, kits, and systems described herein may involve determining a biomarker metric for a selected plurality of biomarkers. In a preferred mode, the method comprises determining biomarker metrics for a subset of the biomarker specific biomarkers described in the embodiments. Alternatively, the method comprises determining a biomarker metric for a subset of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, or 32 specific biomarkers of said biomarkers in an embodiment. Alternatively, the method comprises determining a biomarker metric for a subset of at least 8, 9, 10, 11, 12 or 13 specific biomarkers of said biomarkers in an embodiment. Alternatively, the method comprises determining a biomarker metric for at least a subset of 14, 15, 16, 17, 18, 19, 20 or more (e.g., 33) specific biomarkers of said biomarkers in an embodiment. Alternatively, the methods, kits, and systems described herein can use a particular subset of biomarkers (e.g., at least 13, 15, 19, or 33 biomarkers), as well as one or more biomarkers from another subset of biomarkers (e.g., 13, 15, 19, or 33 biomarkers).

It is within the contemplation of the invention to simultaneously determine whether the biomarker metrics of other biomarkers are associated with a disease of interest. Determining these other biomarker metrics would not prevent classification of subjects according to the present invention. However, the maximum number of biomarkers whose measures are included in the training data and the test data of any of the methods of the present invention can be, for example, 6 different biomarkers, 10 different biomarkers, 13 different biomarkers, 15 different biomarkers, 18 different biomarkers, 20 different biomarkers, or 33 different biomarkers. The skilled person will understand that the number of biomarkers should be limited to avoid inaccurate predictions due to overfitting. The subset of biomarkers can be determined by using the reduction methods described herein. A simplified model of a particular subset of biomarkers is described in the examples.

In a preferred mode, the biomarkers are selected from a subset of the calculations, which comprises the biomarkers contributing the highest model fit metric. As long as these biomarkers are included, the invention does not exclude the inclusion of some other biomarkers that do not necessarily contribute. Inclusion of such other biomarker metrics in the classification model does not preclude classification of the test data as long as the model is designed as described herein. In other embodiments, biomarker metrics for no more than 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 35, 40, or 50 biomarkers are determined for the subject, and the same number of biomarkers are used in the training phase.

In another mode, the selected biomarkers are selected from a subset of calculations from which the biomarkers that contribute least to the measurement of model fit have been removed. The invention does not exclude the inclusion of other biomarkers, which are not necessary, as long as these selected biomarkers are included. Inclusion of such other biomarker metrics in the classification model does not preclude classification of the test data as long as the model is designed as described herein. In other embodiments, biomarker metrics of no more than 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 31, 32, 33, 34, 35, 40, or 50 biomarkers are determined for a subject and the same number of biomarkers are used in the training phase.

It is to be understood that the method of classifying data using a reduced set or subset of biomarkers can be used in any of the methods described herein. In particular, the methods described herein using a reduced number of biomarker classification data can be used in physiological characterization methods (based in part on the classification of the invention) as well as in methods for diagnosing lung diseases such as non-small cell lung cancer. In addition to the reduced number of biomarkers, biomarkers may be added. These other biomarkers may or may not contribute to or enhance diagnosis.

Pulmonary diseases

The present invention provides methods for diagnosing non-small cell lung cancer. These methods comprise determining a biomarker metric for a plurality of biomarkers described herein, wherein the biomarkers indicate the presence or progression of non-small lung cancer. For example, biomarker metrics for the biomarkers described herein can be used to assist in determining the extent of non-small lung cancer progression, the presence of pre-cancerous lesions, or the stage of non-small lung cancer. For example, methods using biomarker metrics described herein can be used to diagnose early stage (stage I) non-small cell lung cancer. Moreover, the biomarker metric may not be indicative of asthma, breast cancer, prostate cancer, pancreatic cancer, or a combination thereof.

In particular embodiments, the subject is selected from those individuals exhibiting one or more symptoms of non-small cell lung cancer. Symptoms may include cough, shortness of breath, wheezing, chest pain, and hemoptysis; pain in the shoulder that propagates down to the outside of the arm or vocal cord paralysis leading to hoarseness; esophageal infiltrates that can lead to dysphagia. If the larger airway is blocked, partial lung collapse may occur and cause infection and abscess or pneumonia. Metastasis to bone can produce extreme pain. The transfer to the brain may result in neurological symptoms including blurred vision, headache, epilepsy or symptoms commonly associated with stroke such as weakness of body parts or loss of sensation. Lung cancer often develops symptoms caused by the production of hormone-like substances by tumor cells. A common side-tumour syndrome in NSCLC is the production of parathyroid hormone-like substances, which leads to elevated blood calcium.

Method for diagnosing non-small cell lung cancer

The present invention relates to methods for diagnosing non-small cell lung cancer in individuals in various populations as described below. In general, these methods rely on determining a biomarker metric for a particular biomarker described herein and classifying the biomarker metric using a classification system that includes a classifier or set of classifiers described herein.

A. Determining the general population

The present invention provides a method of diagnosing non-small cell lung cancer in a subject, comprising: (a) obtaining a physiological sample of a subject; (b) determining a biomarker metric for a plurality of biomarkers in the sample (as described herein); (c) classifying the sample based on the biomarker metric using a classification system, wherein the classification of the sample is indicative of the presence or development of non-small cell lung cancer in the subject.

In a preferred embodiment, the present invention provides a method of diagnosing non-small cell lung cancer in a subject comprising determining a biomarker metric for a plurality of biomarkers in a physiological sample of the subject, wherein an expression pattern of the plurality of markers is indicative of non-small cell lung cancer or is associated with a change in a non-small cell lung cancer disease state (i.e., clinical or diagnostic stage). Preferably, the plurality of biomarkers is selected via a machine learning algorithm, such as a classifier or set of classifiers as described herein, based on the analysis of the training data. The training data will comprise a plurality of biomarker metrics for a number of subjects, and a classification of the disease for the individual subject, and optionally, other characteristics of the subject, such as gender, race, ethnicity, nationality, age, smoking history, and/or working history-in another preferred embodiment, the expression pattern is associated with an increased likelihood that the subject has or may have non-small cell lung cancer. The expression pattern may be characterized by any technique known in the art for pattern recognition, such as those described herein as classifiers and/or sets of classifiers. The plurality of biomarkers can include any combination of the biomarkers described in the examples.

In one embodiment, the subject is at risk for non-small cell lung cancer. In another embodiment, the subject is selected from those individuals exhibiting one or more symptoms of non-small cell lung cancer.

B. Determination of male population

The present invention provides methods for diagnosing non-small cell lung cancer in a male subject. The methods for these embodiments are similar to those described above, except that for the training data and samples, the subject is male.

C. Determination of female population

The present invention provides methods for diagnosing non-small cell lung cancer in a female subject. The methods for these embodiments are similar to those described above, except that for the training data and samples, the subject is a female.

D. Complementary analysis and treatment method for pulmonary nodules

In a preferred mode, the classification method of the present invention can be used in conjunction with computed tomography to provide an enhanced procedure for screening and early detection of NSCLC. In some embodiments, one of the classification methods described herein is applied to biomarker metrics for a plurality of biomarkers in one or more physiological samples from a subject having at least one lung nodule detected by a CT scan. In a particular embodiment, the subject has at least one lung nodule with a diameter of 6-20 mm. Classifying a sample as NSCLC or normal may aid in the final diagnostic characterization of such patients. In other embodiments, after applying the classification method to a sample, those subjects whose samples were classified as NSCLC are selected for further testing by CT scanning and any nodules detected in such patients are treated according to a regimen directed to a "high risk" regimen rather than a "low risk" patient. A preferred classification scheme for enhanced screening is to use a set classification system of repeated sampling (e.g., double-repeat, triple-repeat) and consider a patient whose at least one repeated sample is classified as "NSCLC" by the classifier or set of classifiers described herein as "high risk".

In other embodiments, the invention provides methods of treatment based on the output of any of the classification methods described herein. For example, in one embodiment, the present invention provides a method of treating NSCLC in a subject following classification of "NSCLC" using any of the classification methods described herein. Furthermore, as described in the preceding paragraphs, the invention includes diagnostic-based therapeutic methods that are developed using the classification methods described herein in conjunction with other analytical methods (e.g., CT scans).

Method for designing a characterization system

E. General population

The present invention also provides a method for designing a system for diagnosing non-small cell lung cancer, comprising: (a) selecting a plurality of biomarkers; (b) selecting a method for determining a biomarker metric for the plurality of biomarkers; and (c) designing a system comprising a method for determining a biomarker metric and a method for analyzing the biomarker metric to determine the likelihood that the subject has non-small cell lung cancer. Furthermore, the biomarker metrics described herein may avoid indicating asthma, breast cancer, prostate cancer, pancreatic cancer, or a combination thereof.

The present invention also provides a method for designing a system for diagnosing non-small cell lung cancer in a subject, comprising: (a) selecting a plurality of biomarkers; (b) selecting a method for determining a biomarker metric for the plurality of biomarkers; and (c) designing a system comprising a method for determining a biomarker metric and a method for analyzing the biomarker metric to determine the likelihood that the subject has non-small cell lung cancer.

In the above method, the steps (b) and (c) may be selectively performed by: (b) selecting a detection agent for detecting the plurality of biomarkers, and (c) designing a system comprising the detection agent for detecting the plurality of biomarkers.

F. Male population

The invention also provides methods for designing a system for assisting in diagnosing lung disease in a male subject. The methods used in these embodiments are similar to those described above.

G. Female group

The invention also provides methods for designing a system for assisting in diagnosing lung disease in a female subject. The methods used in these embodiments are similar to those described above.

Classification system

The present invention provides a system that facilitates performing the method of the present invention. An exemplary classification system includes a storage device for storing a training dataset and/or a testing dataset and a computer for executing a learning machine (e.g., a classifier or set of classifiers as described herein). The computer is also operable to collect a training data set from the database, pre-process the training data set, train the learning machine using the pre-processed test data set and in response to receipt of a test output of the trained learning machine, post-process the test output to determine whether the test output is an optimal solution. Such pre-processing may include, for example, visual inspection of the data to detect and remove apparently erroneous entries, normalizing the data by dividing by an appropriate standard amount, and ensuring that the data is in the appropriate form for the corresponding algorithm. The example system may also include a communications device to receive the test data set and the training data set by a remote source. In such cases, the computer may be operable to store the training data set in the storage device prior to preprocessing the training data set, and to store the test data set in the storage device prior to preprocessing the training data set. The exemplary system may also include a display device for displaying the post-processed test data. The computer of the exemplary system may be further configured to perform the various additional functions described above.

The term "computer" as used herein should be understood to include at least one hardware processor that uses at least one memory. The at least one memory may store a set of instructions. The instructions may be stored permanently or temporarily in one or more memories of the computer. The processor executes instructions stored in the one or more memories to process data. The set of instructions may include various instructions to perform one or more particular tasks, such as those described herein. Such a set of instructions for performing a particular task may be characterized as a program, a software program, or simply software.

As described above, a computer executes instructions stored in one or more memories to process data. For example, the data processing may be in response to a command by a user of one or more computers, in response to a previous processing, in response to a request by another computer, and/or any other input.

The computer used to implement at least some embodiments may be a general purpose computer. However, a computer may also utilize any of a variety of other technologies, including a special purpose computer, a computer system including a microcomputer, minicomputer, or mainframe, e.g., a programmed microprocessor, microcontroller, peripheral integrated circuit elements, CSIC (customer specific integrated circuit) or ASIC (application specific integrated circuit) or other integrated circuit, logic circuitry, digital signal processor, programmable logic device such as an FPGA, PLD, PLA or PAL, or any other device or arrangement of devices capable of implementing at least some of the steps in the processes of the invention.

It will be appreciated that the processors and/or memories of the computers need not be physically located in the same geographic location in order to practice the method of the present invention. That is, the various processors and memories used by the computer may be located in geographically distinct locations and connected so as to communicate in any suitable manner. Further, it is to be understood that the various processors and/or memories may be comprised of different physical components of the device. Thus, it is not necessary that the processor be a single device component in one location and the memory be another single device component in another location. That is, for example, it is contemplated that the processor may be two or more device components in two different physical locations. Two or more different pieces of equipment may be connected in any suitable manner, such as a network. Further, the memory may include two or more portions of memory in two or more physical locations.

Various techniques may be used to provide communications between the various computers, processors, and/or memories, as well as to allow the processors and/or memories of the present invention to communicate with any other entity; e.g. to fetch further instructions or to access and use remote memory, e.g. Technologies for providing such communications may include networks, the internet, intranets, extranets, LANs, ethernets or any client server system providing communications. Such communication techniques may use any suitable protocol, such as TCP/IP, UDP, or OSI.

Further, it should be understood that the computer instructions or sets of instructions for implementing and operating the invention are in a suitable form such that the instructions may be read by a computer.

In some implementations, various user interfaces may be utilized to allow a human user to interact with a computer or machine used to implement the implementations at least in part. The user interface may be in the form of a dialog screen. The user interface may also include a mouse, touch screen, keyboard, voice reader, voice recognizer, dialog screen, menu box, list, check box, toggle switch, button, or other device that allows a user to receive information about the operation of the computer and/or provide information to the computer as the computer processes a set of instructions. Thus, a user interface is any device that provides communication between a user and a computer. For example, the information provided to the computer by the user through the user interface may be a command, a data selection, or some other form of input.

It is also contemplated that the user interface of the present invention may interact with another computer rather than a human user, for example, to transmit and receive information. Thus, another computer may be characterized as a user. Further, it is contemplated that the user interface utilized in the system and method of the present invention can interact with another computer portion or portions, as well as with a human user portion.

The following examples are provided to illustrate various modes of the invention disclosed herein, but are not intended to limit the invention in any way.

Examples

59页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:医疗信息处理系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!