Detecting cancer, cancer-derived tissue and/or cancer cell types

文档序号:327894 发布日期:2021-11-30 浏览:53次 中文

阅读说明:本技术 侦测癌症、癌症来源组织及/或癌症细胞类型 (Detecting cancer, cancer-derived tissue and/or cancer cell types ) 是由 奥利弗·克劳德·维恩 亚历山大·P·菲尔兹 萨缪尔·S·格罗斯 刘勤文 简·施伦伯格 約格 于 2020-01-24 设计创作,主要内容包括:本描述提供了一种癌症化验检测组合,用于靶向癌症特异性甲基化模式的侦测。本文进一步提供的方法包括设计、制作及使用癌症化验检测组合来检测癌症及特定类型的癌症。(The present description provides a cancer assay detection combination for the detection of a targeted cancer specific methylation pattern. Further provided herein are methods comprising designing, making, and using cancer assay detection combinations to detect cancer and specific types of cancer.)

1. A composition characterized by: the composition comprises: a plurality of different decoy oligonucleotides, wherein,

wherein the plurality of different decoy oligonucleotides are configured to collectively hybridize to a plurality of DNA molecules derived from at least 200 genomic regions of interest,

wherein in at least one cancer type each genomic region of the at least 200 genomic regions of interest is differentially methylated compared to that in another cancer type or a non-cancer type, and

wherein for at least 80% of all possible pairs of several cancer types selected from a group comprising at least 10 cancer types, the at least 200 genomic regions of interest comprise at least one genomic region of interest that is differentially methylated between the pairs of several cancer types.

2. The composition of claim 1, wherein: the at least 10 cancer types include at least 2, 3, 4, 5, 10, 12, 14, 16, 18, or 20 cancer types.

3. The composition according to any one of claims 1 to 2, characterized in that: the several cancer types are selected from uterine cancer, upper gastrointestinal squamous cancer, all other upper gastrointestinal cancer, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer.

4. The composition according to any one of claims 1 to 2, characterized in that: the several cancer types are selected from anal, bladder, colorectal, esophageal, head and neck, liver/bile duct, lung, lymphoma, ovarian, pancreatic, plasma cell tumor, and gastric cancer.

5. The composition according to any one of claims 1 to 2, characterized in that: the several cancer types are selected from thyroid cancer, melanoma, sarcoma, myeloid tumors, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer.

6. The composition of any one of claims 1 to 5, wherein: the at least 200 genomic regions of interest are selected from any one of lists 1 to 16.

7. The composition of any one of claims 1 to 6, wherein: the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 1-16.

8. The composition according to any one of claims 1 to 7, characterized in that: the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 1 to 16.

9. The composition of any one of claims 1 to 5, wherein: the at least 200 genomic regions of interest are selected from any one of lists 1 to 3.

10. The composition of any one of claims 1 to 5 and 9, wherein: the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95% of the number of target genomic regions in any one of lists 1-3.

11. The composition of any one of claims 1 to 5 and 9 to 10, wherein: the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 1 to 3.

12. The composition of any one of claims 1 to 5, wherein: the at least 200 genomic regions of interest are selected from any one of lists 13 to 16.

13. The composition of any one of claims 1 to 5 and 12, wherein: the at least 200 target genomic regions comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 13-16.

14. The composition of any one of claims 1 to 5 and 12 to 13, wherein: the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 13 to 16.

15. The composition of any one of claims 1 to 5, wherein: the at least 200 genomic regions of interest are selected from list 12.

16. The composition of any one of claims 1 to 5 and 15, wherein: the at least 200 target genomic regions comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in list 12.

17. The composition of any one of claims 1 to 5 and 15 to 16, wherein: the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in the list 12.

18. The composition of any one of claims 1 to 5, wherein: the at least 200 genomic regions of interest are selected from any one of lists 8 to 11.

19. The composition of any one of claims 1 to 5 and 18, wherein: the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 8-11.

20. The composition of any one of claims 1 to 5 and 18 to 19, wherein: the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 8 to 11.

21. The composition of any one of claims 1 to 5, wherein: the at least 200 target genomic regions comprise at least 40%, 50%, 60%, or 70% of the number of target genomic regions in table 4.

22. The composition of any one of claims 1 to 21, wherein: for at least 90% or 100% of all possible pairs of several cancer types selected from a group comprising at least 10 cancer types, the at least 200 genomic regions of interest comprise at least one genomic region of interest that is differentially methylated between the pairs of several cancer types.

23. The composition of any one of claims 1 to 22, wherein: the number of decoy oligonucleotides are hybridized to at least 15 nucleotides or at least 30 nucleotides of the number of DNA molecules derived from the at least 200 genomic regions of interest.

24. The composition of any one of claims 1 to 23, wherein: the number of DNA molecules derived from the at least 200 genomic regions of interest are converted cfDNA fragments.

25. The composition of claim 24, wherein: the cfDNA fragments are converted by a process comprising: treatment with bisulfite.

26. The composition of claim 24, wherein: the cfDNA fragments are converted by an enzymatic conversion reaction.

27. The composition of claim 24, wherein: the cfDNA fragments are converted by a cytosine deaminase.

28. The composition of any one of claims 1 to 27, wherein: each decoy oligonucleotide is conjugated to an affinity moiety.

29. The composition of claim 28, wherein: the affinity moiety is biotin.

30. The composition of any one of claims 1 to 29, wherein: each decoy oligonucleotide is between 50 and 300 bases in length, between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length.

31. A composition characterized by: the composition comprises: a plurality of different decoy oligonucleotides configured to hybridize to a plurality of DNA molecules derived from at least 100 genomic regions of interest selected from any one of lists 1 to 16.

32. The composition of claim 31, wherein: the at least 100 genomic regions of interest include at least 200 genomic regions of interest.

33. The composition of claim 31 or 32, wherein: the at least 100 genomic regions of interest are selected from any one of lists 1 to 16.

34. The composition of any one of claims 31 to 33, wherein: the at least 100 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95% of the number of target genomic regions in any one of lists 1-16.

35. The composition of any one of claims 31 to 34, wherein: the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 1 to 16.

36. The composition of any one of claims 31 to 32, wherein: the at least 100 genomic regions of interest are selected from any one of lists 1 to 3.

37. The composition of any one of claims 31 to 32 and 36, wherein: the at least 100 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95% of the number of target genomic regions in any one of lists 1-3.

38. The composition of any one of claims 31-32 and 36-37, wherein: the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in any of lists 1 to 3.

39. The composition of claim 31 or 32, wherein: the at least 100 genomic regions of interest are selected from list 12.

40. The composition of any one of claims 31 to 32 and 39, wherein: the at least 100 target genomic regions comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in list 12.

41. The composition of any one of claims 31-32 and 39-40, wherein: the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in the list 12.

42. The composition of any one of claims 31 to 32, wherein: the at least 100 genomic regions of interest are selected from list 8.

43. The composition according to one of claims 31-32 and 42, wherein: the at least 100 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in table 8.

44. The composition according to one of claims 31-32 and 42-43, wherein: the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000 or 50000 target genomic regions in list 8.

45. The composition of any one of claims 31 to 32, wherein: the at least 100 target genomic regions comprise at least 40%, 50%, 60%, or 70% of the number of target genomic regions listed in table 4.

46. The composition of any one of claims 31 to 45, wherein: the number of DNA molecules derived from the at least 100 genomic regions of interest is a converted number of cfDNA fragments.

47. The composition of claim 46, wherein: the plurality of cfDNA fragments are converted by a process comprising: treatment with bisulfite.

48. The composition of any one of claims 1 to 47, wherein: the composition further includes a plurality of cfDNA fragments from a test subject.

49. The composition of claim 48, wherein: the number of cfDNA fragments from the test subject is a converted number of cfDNA molecules.

50. The composition of claim 49, wherein: the plurality of cfDNA fragments from the test subject are converted by a process comprising: treatment with bisulfite.

51. The composition of any one of claims 1 to 50, wherein: each target genomic region includes at least 5 CpG dinucleotides.

52. The composition of any one of claims 1 to 51, wherein: each bait oligonucleotide is between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length.

53. The composition of any one of claims 1 to 52, wherein: the different plurality of decoy oligonucleotides comprises: a plurality of sets of two or more decoy oligonucleotides, wherein each decoy oligonucleotide in a set of plurality of decoy oligonucleotides is configured to bind to the converted DNA molecule from the same target genomic region.

54. The composition of any one of claims 1 to 53, wherein: the ratio of the plurality of decoy oligonucleotides configured to be hybridized to the hypermethylated target region to the plurality of decoy oligonucleotides configured to be hybridized to the hypomethylated target region is between 0.5 and 1.0.

55. The composition of claim 54, wherein:

each set of decoy oligonucleotides comprises one or more pairs of a first decoy oligonucleotide and a second decoy oligonucleotide,

each decoy oligonucleotide comprising a 5 'end and a 3' end,

a sequence of at least X nucleotide bases located at the 3 'end of the first decoy oligonucleotide that is identical to a sequence of X nucleotide bases located at the 5' end of the second decoy oligonucleotide, and wherein X is at least 20, at least 25, or at least 30.

56. The composition of claim 55, wherein: x is 30.

57. A method for enriching a cfDNA sample, comprising: the method comprises the following steps:

contacting a converted or unconverted cfDNA sample with the bait set of any one of claims 1 to 56; and

cfDNA samples corresponding to a first set of genomic regions were enriched by heterozygous capture.

58. The method of claim 57, wherein: the sample of cfDNA is a converted cfDNA sample.

59. A method for obtaining sequence information providing information on the presence or absence of a cancer or a type of cancer, characterized by: the method comprises the following steps: sequencing the enriched converted cfDNA prepared via the method of claim 57 or 58.

60. A method for determining the presence or absence of cancer in a subject, comprising: the method comprises the following steps:

a) capturing a number of cfDNA fragments from the subject with the composition of any one of claims 1 to 56,

b) sequencing the captured plurality of cfDNA fragments, and

c) applying a trained classifier to the number of cfDNA sequences to determine the presence or absence of cancer.

61. The method of claim 60, wherein: a false positive determination of the presence or absence of cancer is less than 1% likely, and an accurate determination of the presence or absence of cancer is at least 40% likely.

62. The method of claim 60, wherein: the cancer is a first stage cancer, the likelihood of a false positive determination of the presence or absence of cancer is less than 1%, and the likelihood of an accurate determination of the presence or absence of cancer is at least 10%.

63. The method of any one of claims 60 to 62, wherein: the number of cfDNA fragments is a number of converted cfDNA fragments.

64. A method for detecting a type of cancer, comprising: the method comprises the following steps:

a) capturing a plurality of cfDNA fragments from a subject with a composition comprising a plurality of different oligonucleotide decoys,

b) sequencing the captured plurality of cfDNA fragments, and

c) applying a trained classifier to the plurality of cfDNA sequences to determine a cancer type; wherein the plurality of oligonucleotide decoys are configured to hybridize to a plurality of cfDNA fragments derived from a plurality of genomic regions of interest,

wherein the plurality of genomic regions of interest are differentially methylated in one or more cancer types as compared to in a different cancer type or a non-cancer type,

wherein the likelihood of a false positive determination of cancer is less than 1%, and

wherein a likelihood of an accurate assignment of a cancer type is at least 75%, at least 80%, at least 85%, at least 89%, or at least 90%.

65. The method of claim 64, wherein: the method further comprises: applying the trained classifier to the number of cfDNA sequences to determine the presence or absence of cancer prior to determining the cancer type.

66. The method of claim 64 or 65, wherein: the cancer type is a first stage cancer and an accurately assigned likelihood is at least 75%.

67. The method of claim 64 or 65, wherein: the cancer type is a second stage cancer, and an accurately assigned likelihood is at least 85%.

68. The method of any one of claims 60 to 67, wherein: the cancer type is prostate cancer and a likelihood of an accurate assignment of prostate cancer is at least 85% or at least 95%.

69. The method of any one of claims 64 to 67, wherein: the cancer type is breast cancer, and the likelihood of an accurate assignment of breast cancer is at least 90% or at least 95%.

70. The method of any one of claims 64 to 67, wherein: the cancer type is uterine cancer and an accurately assigned likelihood of uterine cancer is at least 90% or at least 95%.

71. The method of any one of claims 64 to 67, wherein: the cancer type is ovarian cancer, and an accurately assigned likelihood of ovarian cancer is at least 85% or at least 90%.

72. The method of any one of claims 64 to 67, wherein: the cancer types are bladder cancer and urothelial cancer, and the likelihood of an accurate assignment of bladder cancer and urothelial cancer is at least 90% or at least 95%.

73. The method of any one of claims 64 to 67, wherein: the cancer type is colorectal cancer, and the likelihood of an accurate assignment of colorectal cancer is at least 65% or at least 70%.

74. The method of any one of claims 64 to 67, wherein: the cancer type is liver cancer and cholangiocarcinoma, and an accurately assigned likelihood of liver cancer and cholangiocarcinoma is at least 90% or at least 95%.

75. The method of any one of claims 64 to 67, wherein: the cancer types are pancreatic cancer and gallbladder cancer, and an accurately assigned likelihood of pancreatic cancer and gallbladder cancer is at least 85% or at least 90%.

76. The method of any one of claims 63 to 67, wherein: the several cfDNA fragments are converted cfDNA fragments.

77. The method of any one of claims 61 to 76, wherein: the cancer type is selected from uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancers, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer and anorectal cancer.

78. The method of any one of claims 63 to 76, wherein: the cancer type is selected from anal cancer, bladder cancer, colorectal cancer, esophageal cancer, head and neck cancer, liver/bile duct cancer, lung cancer, lymphoma, ovarian cancer, pancreatic cancer, plasma cell tumor, and gastric cancer.

79. The method of any one of claims 63 to 76, wherein: the cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid neoplasm, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer.

80. The method of any one of claims 63 to 79, wherein: the likelihood of detecting a sarcoma is at least 35% or at least 40%.

81. The method of any one of claims 63 to 76, wherein: the likelihood of detecting stage three or stage four renal cancer is at least 50% or at least 70%.

82. The method of any one of claims 63 to 76, wherein: the likelihood of detecting stage three or stage four breast cancer is at least 70% or at least 85%.

83. The method of any one of claims 63 to 76, wherein: the likelihood of detecting stage three or stage four uterine cancer is at least 50%.

84. The method of any one of claims 63 to 76, wherein: the likelihood of detecting ovarian cancer is at least 60% or at least 80%.

85. The method of any one of claims 63 to 76, wherein: the likelihood of detecting bladder cancer is at least 35% or at least 40%.

86. The method of any one of claims 63 to 76, wherein: the likelihood of detecting anorectal cancer is at least 60% or 70%.

87. The method of any one of claims 63 to 76, wherein: the likelihood of detecting head and neck cancer is at least 75% or at least 80%.

88. The method of any one of claims 63 to 76, wherein: the likelihood of detecting first stage head and neck cancer is at least 80%.

89. The method of any one of claims 63 to 76, wherein: the likelihood of detecting colorectal cancer is at least 50% or at least 59%.

90. The method of any one of claims 63 to 76, wherein: the likelihood of detecting liver cancer is at least 75% or less than 80%.

91. The method of any one of claims 63 to 76, wherein: the likelihood of detecting pancreatic cancer and gallbladder cancer is at least 64% or at least 70%.

92. The method of any one of claims 63 to 76, wherein: the likelihood of detecting upper digestive tract cancer is at least 60% or at least 68%.

93. The method of any one of claims 63 to 76, wherein: the likelihood of detecting multiple myeloma is at least 65% or at least 75%.

94. The method of any one of claims 63 to 76, wherein: the likelihood of detecting stage i multiple myeloma is at least 60%.

95. The method of any one of claims 63 to 76, wherein: the likelihood of detecting lymphoma is at least 65% or at least 69%.

96. The method of any one of claims 63 to 76, wherein: the likelihood of detecting lung cancer is at least 50% or at least 58%.

97. The method of any one of claims 63 to 96, wherein: the composition comprising several oligonucleotide decoys is the composition of any one of claims 1 to 56.

98. The method of any one of claims 63 to 97, wherein: the plurality of genomic regions comprises: no more than 95000 genomic regions, no more than 60000 genomic regions, no more than 40000 genomic regions, no more than 35000 genomic regions, no more than 20000 genomic regions, no more than 15000 genomic regions, no more than 8000 genomic regions, no more than 4000 genomic regions, no more than 2000 genomic regions, or no more than 1400 genomic regions.

99. The method of any one of claims 61 to 98, wherein: the total size of the several genomic regions is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB.

100. The method of any one of claims 61 to 99, wherein: the subject has a high risk of one or more cancer types.

101. The method of any one of claims 61 to 100, wherein: the subject exhibits symptoms associated with one or more cancer types.

102. The method of any one of claims 61 to 101, wherein: the subject has not been diagnosed with cancer.

103. The method of any one of claims 61 to 102, wherein: the classifier is trained on a number of converted DNA sequences derived from at least 100 subjects having a first cancer type, at least 100 subjects having a second type of cancer, and at least 100 subjects not having cancer.

104. The method of claim 103, wherein: the first cancer type is ovarian cancer.

105. The method of claim 103, wherein: the first cancer type is liver cancer.

106. The method of claim 103, wherein: the first cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid tumors, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, pancreatic cancer, gallbladder cancer, esophageal cancer, gastric cancer, multiple myeloma, lymphoma, lung cancer, and leukemia.

107. The method of any one of claims 61 to 106, wherein: the classifier is trained on several converted DNA sequences derived from at least 1000, at least 2000, or at least 4000 target genomic regions selected from any one of lists 1 to 16.

108. The method of claim 107, wherein: the trained classifier determines the presence or absence of cancer or a type of cancer by:

a) generating a set of features for a sample, wherein each feature in the set of features comprises a numerical value;

b) inputting the set of features into the classifier, wherein the classifier comprises a polynomial classifier;

c) Based on the set of features, determining a set of probability scores at the classifier, wherein the set of probability scores comprises one probability score for each cancer type class and each non-cancer type class; and

d) the set of probability scores is weighted by a threshold based on one or more values determined during training of the classifier to determine a final cancer classification for the sample.

109. The method of claim 108, wherein: the set of features includes a set of binarized features.

110. The method of any one of claims 108 to 109, wherein: the numerical value comprises a single binary value.

111. The method of any one of claims 108 to 110, wherein: the polynomial classifier includes a polynomial logistic regression ensemble trained to predict a source tissue for the cancer.

112. The method of any one of claims 108 to 111, wherein: the method further comprises the steps of: determining the final cancer classification based on a highest two probability score differences with respect to a minimum value, wherein the minimum value corresponds to a predefined proportion of training cancer samples that are assigned the correct cancer type as the highest score upon training of the classifier.

113. The method of claim 112, wherein:

a) upon determining that the highest two probability scores differ by more than the minimum value, assigning a cancer label as the final cancer classification, the cancer label corresponding to the highest probability score determined by the classifier; and

b) upon determining that the highest two probability scores differ by not exceeding the minimum value, assigning an uncertain cancer label as the final cancer classification.

114. A method of treating a cancer type in a subject in need thereof, comprising: the method comprises the following steps:

a) detecting the type of cancer by the method of any one of claims 61 to 113; and

b) administering an anti-cancer therapeutic to the subject.

115. The method of claim 114, wherein: the anti-cancer therapeutic is a chemotherapeutic selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, antitumor antibiotics, cytoskeletal disruptors, topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.

116. A cancer assay detection combination, characterized in that: the cancer assay detection combination comprises:

At least 500 pairs of probes, wherein each of the at least 500 pairs of probes comprises: two probes configured to overlap each other by an overlapping sequence,

wherein the overlapping sequence comprises a 30 nucleotide sequence, an

Wherein the 30-nucleotide sequence is configured to hybrid to a converted cfDNA molecule corresponding to, or derived from, one or more genomic regions, wherein each of the several genomic regions comprises at least five methylation sites, and wherein the at least five methylation sites have an aberrant methylation pattern in several cancer samples.

117. A cancer assay detection combination as claimed in claim 116 wherein: each of the at least 500 pairs of probes is conjugated to a non-nucleotide affinity moiety.

118. A cancer assay detection combination as claimed in claim 117 wherein: the non-nucleotide affinity moiety is a biotin moiety.

119. A cancer assay detection combination according to any one of claims 116 to 118, wherein: the plurality of cancer samples are from a plurality of subjects having a cancer selected from the group consisting of breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of the renal pelvis, renal cancer other than urothelium, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer caused by hepatocytes, hepatobiliary cancer caused by cells other than hepatocytes, pancreatic cancer, upper gastrointestinal squamous cell carcinoma, upper gastrointestinal cancer other than squamous cell carcinoma, head and neck cancer, lung adenocarcinoma, small cell lung cancer, squamous cell lung cancer, and cancers other than adenocarcinoma or small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, multiple myeloma, lymphoma, and leukemia.

120. A cancer assay detection combination according to any one of claims 116 to 119, wherein: the aberrant methylation pattern has at least a threshold value of p-value rarity in the plurality of cancer samples.

121. A cancer assay detection combination according to any one of claims 116 to 120, wherein: each of the several probes is designed to have less than 20 off-target genomic regions.

122. A cancer assay detection combination as claimed in claim 121 wherein: the fewer than 20 off-target genomic regions are identified using a k-mer seeding strategy.

123. A cancer assay detection combination as claimed in claim 122 wherein: the less than 20 off-target genomic regions are identified using a k-mer seeding strategy in combination with local alignment at several seed sites.

124. A cancer assay detection combination according to any one of claims 116 to 123, wherein: the cancer assay detection combination comprises: at least 10000, 50000, 100000, 200000, 300000, 400000, 500000, 600000, 700000 or 800000 probes.

125. A cancer assay detection combination according to any one of claims 116 to 124, wherein: the at least 500 pairs of probes collectively comprise at least 2 million, 3 million, 4 million, 5 million, 6 million, 8 million, 1 thousand 2 million, 1 thousand 4 million, or 1 thousand 5 million nucleotides.

126. A cancer assay detection combination according to any one of claims 116 to 125, wherein: each of the several probes comprises at least 50, 75, 100, or 120 nucleotides.

127. A cancer assay detection combination according to any one of claims 116 to 126, wherein: each of the number of probes includes less than 300, 250, 200, or 150 nucleotides.

128. A cancer assay detection combination according to any one of claims 116 to 127, wherein: each of the several probes comprises 100 to 150 nucleotides.

129. A cancer assay detection combination according to any one of claims 116 to 128, wherein: each of the several probes includes less than 20, 15, 10, 8, or 6 methylation sites.

130. A cancer assay detection combination according to any one of claims 116 to 129, wherein: at least 80, 85, 90, 92, 95, or 98% of the at least five methylation sites are methylated or unmethylated in the plurality of cancer samples.

131. A cancer assay detection combination according to any one of claims 116 to 130, wherein: at least 3%, 5%, 10%, 15%, or 20% of the plurality of probes do not include guanine G.

132. A cancer assay detection combination according to any one of claims 116 to 131, wherein: each of the number of probes comprises a plurality of binding sites to the number of methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the plurality of binding sites comprise only CpG or CpA.

133. A cancer assay detection combination according to any one of claims 116 to 132, wherein: each of the number of probes is configured to have less than 15, 10, or 8 off-target genomic regions.

134. A cancer assay detection combination according to any one of claims 116 to 133, wherein: at least 30% of the several genomic regions are in exons or introns.

135. A cancer assay detection combination according to any one of claims 116 to 134, wherein: at least 15% of the several genomic regions are in exons.

136. A cancer assay detection combination according to any one of claims 116 to 135, wherein: at least 20% of the several genomic regions are in exons.

137. A cancer assay detection combination according to any one of claims 116 to 136, wherein: less than 10% of the several genomic regions are in intergenic regions.

138. A cancer assay detection combination according to any one of claims 116 to 137 wherein: the number of genomic regions is selected from any one of lists 1-3 or lists 4-16.

139. A cancer assay detection combination according to any one of claims 116 to 138, wherein: the number of genomic regions comprises at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions in any of lists 1-3 or lists 4-16.

140. A cancer assay detection combination according to any one of claims 116 to 139, wherein: the number of genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, 50000, 60000, or 70000 genomic regions in any one of listings 1 to 3 or listings 4 to 16.

141. A cancer assay detection combination comprising a plurality of probes, wherein: each of the number of probes is configured to hybridize to a converted cfDNA molecule corresponding to one or more of the number of genomic regions of any of lists 1-3 or 4-16.

142. A cancer assay detection combination as claimed in claim 141 wherein: the number of probes are configured together to heterozygous to a number of converted cfDNA molecules corresponding to at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions of any of lists 1-3 or lists 4-16.

143. A cancer assay detection combination as claimed in claim 141 wherein: the number of probes are configured together to hybridize to a number of converted cfDNA molecules corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 genomic regions of any of lists 1-3 or lists 4-16.

144. A cancer assay detection combination according to any one of claims 141 to 143, wherein: at least 3%, 5%, 10%, 15%, or 20% of the plurality of probes do not include guanine G.

145. A cancer assay detection combination according to any one of claims 141 to 144, wherein: each of the number of probes comprises a plurality of binding sites to a number of methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the plurality of binding sites comprise only CpG or CpA.

146. A cancer assay detection combination according to any one of claims 141 to 145, wherein: each of the plurality of probes is conjugated to a non-nucleotide affinity moiety.

147. A cancer assay detection combination as claimed in claim 146 wherein: the non-nucleotide affinity moiety is a biotin moiety.

148. A method of determining a source tissue of a cancer, comprising: the method comprises the following steps:

a) receiving a sample, the sample comprising a plurality of cfDNA molecules;

b) processing the plurality of cfDNA molecules to convert unmethylated cytosine C to uracil U, thereby obtaining a plurality of converted cfDNA molecules;

c) applying the cancer assay detection combination of any one of claims 116-147 to the number of converted cfDNA molecules, thereby enriching a subset of the number of converted cfDNA molecules; and

d) sequencing the enriched subset of the converted cfDNA molecules, thereby providing a set of sequence reads.

149. The method of claim 148, wherein: the method further comprises the steps of: determining a health condition by evaluating the set of sequence reads, wherein the health condition is

a) The presence or absence of cancer;

b) the presence or absence of a cancer of a source tissue;

c) the presence or absence of a cancer cell type;

d) the presence or absence of at least 5, 10, 15 or 20 different types of cancer.

150. The method of any one of claims 148 to 149, wherein: the sample includes a number of cfDNA molecules obtained from a human subject.

151. A method for detecting a cancer, comprising: the method comprises the following steps:

a) obtaining a set of sequence reads by sequencing a set of nucleic acid fragments from a subject, wherein the plurality of nucleic acid fragments correspond to or are derived from a plurality of genomic regions selected from any one of lists 1-3 or lists 4-16;

b) determining, for each of the plurality of nucleic acid fragments, a methylation status at a plurality of CpG sites; and

c) detecting a health status of the subject by evaluating methylation status of the plurality of sequence reads, wherein the health status is: (i) the presence or absence of a cancer; (ii) the presence or absence of a cancer of a source tissue; (iii) the presence or absence of a cancer cell type; or (iv) the presence or absence of at least 5, 10, 15 or 20 different types of cancer.

152. The method of claim 151, wherein: the number of genomic regions comprises at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions in any of lists 1-3 or lists 4-16.

153. The method of claim 151, wherein: the number of genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, 50000, 60000, 70000, or 80000 genomic regions of the number of genomic regions in any one of lists 1-3 or lists 4-16.

154. A method of designing a cancer assay detection set for diagnosing cancer in a source tissue, comprising: the method comprises the following steps:

a) identifying a plurality of genomic regions, wherein each of the plurality of genomic regions (i) comprises at least 30 nucleotides and (ii) comprises at least five methylation sites,

b) selecting a subset of the genomic regions, wherein the selecting is performed when a number of cfDNA molecules corresponding to or derived from each of the genomic regions in a number of cancer samples have an aberrant methylation pattern, wherein the aberrant methylation pattern comprises at least five hypomethylated or hypermethylated methylation sites, and

c) Designing a cancer assay detection combination comprising a plurality of probes, wherein each of the plurality of probes is configured for shuffling into a converted cfDNA molecule corresponding to or derived from one or more of the subsets of the plurality of genomic regions.

155. A decoy set for hybrid capture, comprising: the decoy set comprises a plurality of different oligonucleotide-containing probes, wherein each of the plurality of oligonucleotide-containing probes comprises a sequence of at least 30 bases in length that is complementary to any one of:

(1) a sequence of a genomic region; or

(2) A sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region; and

wherein the plurality of different oligonucleotide-containing probes are complementary to a sequence corresponding to a CpG site that is differentially methylated in samples from subjects from a first cancer type as compared to samples from subjects from a second cancer type or a non-cancer type.

156. The bait set of claim 155, wherein: the first cancer type and the second cancer type are selected from uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer.

157. The bait set of any one of claims 155 to 156, wherein: the bait set comprises at least 500, 1000, 2000, 2500, 5000, 6000, 7500, 10000, 15000, 20000, 25000, 50000, 100000, 200000, 300000, 500000 or 800000 different oligonucleotide-containing probes.

158. The decoy set of any one of claims 155 to 157, wherein: for each of the several different oligonucleotide-containing probes, the sequence that is at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 1 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

159. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 1 to 3; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

160. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of list 5 or 7; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

161. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 4, 8 or 8 to 12; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

162. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 13 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

163. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 13 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

164. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of list 4 or 6; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

165. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 4; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

166. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 8; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

167. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 9; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

168. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in list 10; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

169. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 11; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

170. The bait set of claim 158, wherein: the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 12; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region.

171. The bait set of any one of claims 155 to 170, wherein: each of the plurality of different oligonucleotide-containing probes is conjugated to an affinity moiety.

172. The bait set of claim 171, wherein: the affinity moiety is biotin.

173. The bait set of any one of claims 155 to 172, wherein: at least 80%, 90%, or 95% of the plurality of oligonucleotide-containing probes in the decoy set do not comprise: a sequence of at least 30, at least 40, or at least 45 bases in the genome having 20 or more off-target regions.

174. The decoy set of any one of claims 155 to 173, wherein: the plurality of oligonucleotide-containing probes in the decoy set does not comprise: a sequence of at least 30, at least 40, or at least 45 bases in the genome having 20 or more off-target regions.

175. The decoy set of any one of claims 155 to 174, wherein: the sequence of at least 30 bases of each of the number of probes is at least 40, at least 45, at least 50, at least 60, at least 75, or at least 100 bases in length.

176. The bait set of any one of claims 155 to 175, wherein: each of the plurality of oligonucleotide-containing probes has a nucleic acid sequence of at least 45, 40, 75, 100, or 120 bases in length.

177. The bait set of any one of claims 155 to 176, wherein: each of the plurality of oligonucleotide-containing probes has a nucleic acid sequence of no more than 300, 250, 200, or 150 bases in length.

178. The decoy set of any one of claims 155 to 177, wherein: each of the plurality of different oligonucleotide-containing probes is between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length.

179. The bait set of any one of claims 155 to 178, wherein: the plurality of different oligonucleotide-containing probes comprises at least 500, at least 1000, at least 2000, at least 2500, at least 5000, at least 6000, at least 7500, and at least 10000, at least 15000, at least 20000, or at least 25000 different probe pairs, wherein each probe pair comprises a first probe and a second probe, wherein the second probe is different from the first probe and overlaps the first probe by an overlapping sequence, the overlapping sequence being at least 30, at least 40, at least 50, or at least 60 nucleotides in length.

180. The bait set of any one of claims 155 to 179, wherein: the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 1-16.

181. The bait set of claim 180, wherein: the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 1-3.

182. The bait set of claim 180, wherein: the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 4-12.

183. The bait set of claim 180, wherein: the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 4, 6, or 8-12.

184. The bait set of claim 180, wherein: the decoy set includes a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in list 8.

185. The bait set of any one of claims 155 to 184, wherein: a whole of a number of oligonucleotide probes in the decoy set is configured to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from any one of lists 1-16.

186. The bait set of claim 185, wherein: the overall configuration of a number of oligonucleotide probes in the decoy set is to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from any one of lists 1-3.

187. The bait set of claim 185, wherein: the overall configuration of a number of oligonucleotide probes in the decoy set is to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from any one of lists 4-12.

188. The bait set of claim 185, wherein: the overall configuration of several oligonucleotide probes in the decoy set is used to hybridize to several fragments obtained from several cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the several genomic regions in a list selected from any of lists 4, 6, or 8 to 12.

189. The bait set of claim 185, wherein: the overall configuration of a number of oligonucleotide probes in the decoy set is to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from list 8.

190. The bait set of any one of claims 155 to 189, wherein: a whole of several oligonucleotide-containing probes in the decoy set are configured to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 1 to 16.

191. The bait set of claim 190, wherein: the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 1 to 3.

192. The bait set of claim 190, wherein: the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 4 to 12.

193. The bait set of claim 190, wherein: the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 4, 6, or 8 to 12.

194. The bait set of claim 190, wherein: the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions in table 8.

195. The bait set of any one of claims 155 to 194, wherein: the plurality of oligonucleotide-containing probes comprises at least 500, 1000, 5000, or 10000 distinct pluralities of probe subsets, wherein each probe subset comprises a plurality of probes collectively extending across a genomic region selected from the plurality of genomic regions of any one of lists 1 to 16 in a 2 x tiled manner.

196. The bait set of any one of claims 155 to 195, wherein: the plurality of oligonucleotide-containing probes comprises at least 500, 1000, 5000, or 10000 distinct pluralities of probe subsets, wherein each probe subset comprises a plurality of probes collectively extending across a genomic region selected from the plurality of genomic regions of any one of lists 1-4, 6, or 8-12 in a 2 x tiled manner.

197. The bait set of any one of claims 195 to 196, wherein: the plurality of probes collectively extending across the genomic region in a 2 x tiling manner includes: at least one pair of probes overlapping a sequence of at least 30 bases, at least 40 bases, at least 50 bases, or at least 60 bases in length.

198. The decoy set of any one of claims 155 to 196, wherein: the plurality of probes collectively extend across a plurality of portions of the genome, a combined size of the plurality of portions being less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB.

199. The decoy set of any one of claims 155 to 196, wherein: the plurality of probes collectively extend across a plurality of portions of the genome, a combined size of the plurality of portions being between 0.2 and 30MB, between 0.5MB and 30MB, between 1MB and 30MB, between 3MB and 25MB, between 3MB and 15MB, between 5MB and 20MB, or between 7MB and 12 MB.

200. The bait set of any one of claims 155 to 199, wherein: each of the plurality of different oligonucleotide-containing probes comprises less than 20, 15, 10, 8, or 6 CpG detection sites.

201. The bait set of any one of claims 155 to 200, wherein: at least 80%, 85%, 90%, 92%, 95%, or 98% of the plurality of oligonucleotide-containing probes have only CpG or CpA at all CpG detection sites.

202. A mixture, characterized by: the mixture comprises:

converted cfDNA; and

a decoy set according to any one of claims 155 to 201.

203. A mixture as set forth in claim 202, wherein: the converted cfDNA comprises bisulfite converted cfDNA.

204. A mixture as set forth in claim 202, wherein: the converted cfDNA comprises cfDNA converted via a cytosine deaminase.

205. A method for enriching a converted cfDNA sample, the method comprising:

contacting the converted cfDNA sample with the bait set of any one of claims 155 to 201; and

enriching the sample for a first set of genomic regions by heterozygous capture.

206. A method for providing sequence information, characterized by: the sequence information may provide information on the presence or absence of a cancer or a type of cancer, the method comprising the steps of:

a) treating cfDNA from a biological sample with a deaminating agent to produce an episomal DNA sample comprising a plurality of deaminated nucleotides;

b) enriching the cfDNA sample for indicative of a number of free DNA molecules; and

c) Sequencing the enriched number of cfDNA molecules, thereby obtaining a set of sequence reads for indicating the presence or absence of a cancer or a type of cancer.

207. The method of claim 206, wherein: enriching the cfDNA comprises: amplifying several portions of the several episomal DNA fragments by PCR using several primers configured to hybridize to several genomic regions selected from any one of lists 1-16.

208. The method of claim 206, wherein: enriching the cfDNA comprises: contacting the episomal DNA with a number of probes configured to hybridize to a number of converted fragments obtained from the number of cfDNA molecules, the number of converted fragments corresponding to or derived from the number of genomic regions of any one of tables 1-16.

209. The method of claim 206, wherein: enriching the cfDNA comprises: contacting the episomal DNA with a number of probes configured to hybridize to a number of converted fragments obtained from the number of cfDNA molecules, the number of converted fragments corresponding to or derived from at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% of the number of genomic regions of any one of lists 1-16.

210. The method of any one of claims 206 to 209, wherein: the number of genomic regions is selected from any one of lists 1-3.

211. The method of any one of claims 206 to 209, wherein: the number of genomic regions is selected from any one of lists 4 to 12.

212. The method of any one of claims 206 to 209, wherein: the number of genomic regions is selected from any one of lists 4, 6, or 8 to 12.

213. The method of any one of claims 206 to 209, wherein: the several genomic regions are selected from table 8.

214. The method of any one of claims 206 to 213, wherein: enriching the cfDNA sample by the method of claim 25.

215. The method of any one of claims 206 to 214, wherein: the method further comprises: determining a cancer classification by evaluating the set of sequence reads, wherein the cancer classification is

a) The presence or absence of cancer; or

b) The presence or absence of a type of cancer.

216. The method of claim 215, wherein: the step of determining a cancer classification comprises the steps of:

a) Generating a detection feature vector based on the set of sequence reads; and

b) applying the detected feature vector to a classifier.

217. The method of claim 216, wherein: the classifier includes a model trained by a training process having a first set of cancer segments from one or more training subjects having a first cancer type and a second set of cancer segments from one or more training subjects having a second cancer type, wherein the first set of cancer segments and the second set of cancer segments include training segments.

218. The method of any one of claims 206 to 217, wherein: the cancer classification is the presence or absence of cancer.

219. The method of claim 218, wherein: an area of the classifier under a receiver operating characteristic curve is at least 0.80.

220. The method of any one of claims 206 to 217, wherein: the cancer classification is a type of cancer.

221. The method of claim 220, wherein: the type of cancer is selected from at least 12, 14, 16, 18 or 20 cancer types.

222. The method of claim 220, wherein: the several cancer types are selected from uterine cancer, upper gastrointestinal squamous cancer, all other upper gastrointestinal cancer, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer.

223. The method of claim 220, wherein: the several cancer types are selected from anal, bladder, colorectal, esophageal, head and neck, liver/bile duct, lung, lymphoma, ovarian, pancreatic, plasma cell tumor, and gastric cancer.

224. The method of claim 220, wherein: the several cancer types are selected from thyroid cancer, melanoma, sarcoma, myeloid tumors, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer.

225. The method of any one of claims 220 to 224, wherein:

wherein at a specificity of 99%, the method has a sensitivity to head and neck cancer of at least 79% or at least 84%;

wherein the sensitivity of the method to liver cancer is at least 82% or at least 85% at a specificity of 99%;

wherein at 99% specificity, the method has a sensitivity to upper digestive tract cancer of at least 62% or at least 68%;

wherein at a specificity of 99%, the method has a sensitivity to pancreatic cancer or gallbladder cancer of at least 62% or at least 68%;

wherein the sensitivity of the method to colorectal cancer is at least 60% or at least 65% at a specificity of 99%;

wherein at a specificity of 99%, the method has a sensitivity to ovarian cancer of at least 75% or at least 80%;

wherein the sensitivity of the method to liver cancer is at least 60% or at least 65% at a specificity of 99%;

wherein the sensitivity of the method to multiple myeloma is at least 68% or at least 75% at a specificity of 99%;

wherein the sensitivity of the method to lymphoma is at least 65% or at least 70% at a specificity of 99%;

wherein the sensitivity of the method to anorectal cancer is at least 60% or at least 65% at a specificity of 99%; and

Wherein the sensitivity of the method to bladder cancer is at least 40% or at least 44% at a specificity of 99%.

226. The method of claim 215, wherein: the cancer classification is the presence or absence of a type of cancer.

227. The method of claim 226, wherein: the step of determining a cancer classification comprises the steps of:

a) generating a detection feature vector based on the set of sequence reads; and

b) applying the detected feature vector to a classifier.

228. The method of claim 227, wherein: the classifier includes a model trained by a training process having transformed DNA sequences from a first cancer type group, one or more training subjects from a first cancer type, and transformed DNA sequences from a second cancer type group, one or more training subjects from a second cancer type, wherein the transformed DNA sequences of the first cancer type group and the transformed DNA sequences of the second cancer type group comprise trained transformed DNA sequences.

229. The method of any one of claims 226 to 228, wherein: the cancer type is selected from the group consisting of head and neck cancer, liver/bile duct cancer, upper digestive tract cancer, pancreas/gall bladder cancer, colorectal cancer, ovarian cancer, lung cancer, multiple myeloma, lymphoma, melanoma, sarcoma, breast cancer, and uterine cancer.

230. The method of any one of claims 227 to 229, wherein: the type of cancer is head and neck cancer, and the method has a sensitivity of at least 79% or at least 84% with a specificity of 99%.

231. The method of any one of claims 227 to 229, wherein: the type of cancer is liver cancer, and the method has a sensitivity of at least 82% or at least 85% at a specificity of 99%.

232. The method of any one of claims 227 to 229, wherein: the type of cancer is upper digestive tract cancer and the method has a sensitivity of at least 62% or at least 68% at a specificity of 99%.

233. The method of any one of claims 227 to 229, wherein: the type of cancer is pancreatic cancer or gallbladder cancer, and the method has a sensitivity of at least 62% or at least 68% at a specificity of 99%.

234. The method of any one of claims 227 to 229, wherein: the type of cancer is colorectal cancer, and the method has a sensitivity of at least 60% or at least 65% at a specificity of 99%.

235. The method of any one of claims 227 to 229, wherein: the type of cancer is ovarian cancer, and the method has a sensitivity of at least 75% or at least 80% at a specificity of 99%.

236. The method of any one of claims 227 to 229, wherein: the type of cancer is lung cancer, and the method has a sensitivity of at least 60% or at least 65% with a specificity of 99%.

237. The method of any one of claims 227 to 229, wherein: the type of cancer is multiple myeloma and the method has a sensitivity of at least 68% or at least 75% at a specificity of 99%.

238. The method of any one of claims 227 to 229, wherein: the type of cancer is lymphoma, and the method has a sensitivity of at least 65% or at least 70% at a specificity of 99%.

239. The method of any one of claims 227 to 229, wherein: the type of cancer is anorectal cancer and the method has a sensitivity of at least 60% or at least 65% at a specificity of 99%.

240. The method of any one of claims 227 to 229, wherein: the type of cancer is bladder cancer and the method has a sensitivity of at least 40% or at least 44% at a specificity of 99%.

241. The method of any one of claims 206 to 240, wherein: the total size of the several genomic regions of interest is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB.

242. The method of any one of claims 206 to 240, wherein: the step of determining a cancer classification comprises the steps of:

a) generating a detection feature vector based on the set of sequence reads; and

b) applying the detected feature vectors to a model obtained by a training process having a set of cancer fragments from one or more training subjects with cancer and a set of non-cancer fragments from one or more training subjects without cancer, wherein the set of cancer fragments and the set of non-cancer fragments comprise a number of trained fragments.

243. The method of claim 242, wherein: the training process comprises the steps of:

a) Obtaining sequence information from a plurality of training segments of a plurality of training subjects;

b) for each training fragment, determining whether the training fragment is hypomethylated or hypermethylated, wherein each of the hypomethylated training fragments and hypermethylated training fragments comprises: at least a threshold number of CpG sites, wherein at least a threshold percentage of CpG sites are unmethylated or methylated, respectively;

c) for each training subject, generating a training feature vector based on the hypomethylated training fragments and the hypermethylated training fragments; and

d) training the model using the training feature vectors from the one or more training subjects without cancer and training feature vectors from the one or more training subjects with cancer.

244. The method of claim 242, wherein: the training process comprises the steps of:

a) obtaining sequence information from a plurality of training segments of a plurality of training subjects;

b) for each training fragment, determining whether the training fragment is hypomethylated or hypermethylated, wherein each of the hypomethylated training fragments and hypomethylated training fragments comprises: at least a threshold number of CpG sites, wherein at least a threshold percentage of CpG sites are unmethylated or methylated, respectively;

c) For each of several CpG sites in a reference genome:

quantifying a count of hypomethylated training fragments that overlap said CpG sites and a count of hypermethylated training fragments that overlap said CpG sites; and

generating a hypomethylation score and an hypermethylation score based on the counts of the hypomethylation training fragments and the hypermethylation training fragments;

d) for each training fragment, generating a total hypomethylation score based on the hypomethylation scores of the CpG sites in the training fragment, and generating a total hypermethylation score based on the hypermethylation scores of the CpG sites in the training fragment;

e) for each training subject:

ranking the plurality of training fragments based on the total hypomethylation score, ranking the plurality of training fragments based on the total hypermethylation score; and

generating a feature vector based on the ranking of the training segments;

f) obtaining a number of training feature vectors for one or more training subjects that do not have cancer and a number of training feature vectors for one or more training subjects that have cancer; and

g) the model is trained using a number of feature vectors of one or more training subjects that do not have cancer and a number of feature vectors of one or more training subjects that have cancer.

245. The method of any one of claims 242 to 244, wherein: the model includes one of a kernel logistic regression classifier, a random forest classifier, a hybrid model, a convolutional neural network, and an auto-encoder model.

246. The method of any one of claims 242 to 245, wherein: the method further comprises the steps of:

a) obtaining a probability of cancer for the test sample based on the model; and

b) comparing the cancer probability to a threshold probability to determine whether the test sample is from a subject with cancer or without cancer.

247. The method of claim 246, wherein: the method further comprises the steps of:

a) obtaining a probability of a cancer type for the test sample based on the model; and

b) comparing the cancer type probability to a threshold probability to determine whether the test sample is from a subject with cancer type or other cancer type or not.

248. The method of any one of claims 246 to 247, wherein: the method further comprises: administering an anti-cancer agent to the subject.

249. A method for treating a cancer patient, the method comprising:

the method of claim 246, an anti-cancer agent is administered to a subject identified as a cancer patient.

250. A method of treating a cancer patient, comprising:

the method of claim 247, administering an anti-cancer agent to a subject identified as a cancer patient

251. The method of any of claims 249-250, wherein: the anti-cancer agent is a chemotherapeutic agent selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, antitumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.

252. A method for assessing whether a subject has a cancer, the method comprising:

obtaining cfDNA from the subject;

isolating a portion of the cfDNA from the subject by heterozygous capture;

obtaining a number of sequence reads derived from the captured cfDNA to determine a number of methylation states of a number of cfDNA fragments;

Applying a classifier to the plurality of sequence reads; and

determining whether the subject has cancer based on the application of the classifier;

wherein an area of the classifier under the receiver operator characteristic curve is at least 0.80.

253. The method of claim 252, wherein: the method further comprises: determining the type of the cancer to be detected,

wherein the sensitivity of the method to head and neck cancer is at least 79% or at least 84%;

wherein the sensitivity of the method to liver cancer is at least 82% or at least 85%;

wherein the sensitivity of the method to upper digestive tract cancer is at least 62% or at least 68%;

wherein the sensitivity of the method to pancreatic cancer or gallbladder cancer is at least 62% or at least 68%;

wherein the sensitivity of the method to colorectal cancer is at least 60% or at least 65%;

wherein the sensitivity of the method to ovarian cancer is at least 75% or at least 80%;

wherein the method has a sensitivity to lung cancer of at least 60% or at least 65%;

wherein the sensitivity of the method to multiple myeloma is at least 68% or at least 75%;

wherein the sensitivity of the method to lymphoma is at least 65% or at least 70%;

Wherein the sensitivity of the method to anorectal cancer is at least 60% or at least 65%; and

wherein the sensitivity of the method to bladder cancer is at least 40% or at least 44%.

254. The method of any one of claims 252 to 253, wherein: the total size of the several genomic regions of interest is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB.

255. The method of any one of claims 252 to 254, wherein: the method further comprises: converting unmethylated cytosines in the cfDNA to uracil prior to capturing the portion of the cfDNA isolated from the subject by shuffling.

256. The method of any one of claims 252 to 255, wherein: the method further comprises: converting unmethylated cytosines in the cfDNA to uracil prior to capturing the portion of the cfDNA isolated from the subject by shuffling.

257. The method of any one of claims 252 to 256, wherein: the classifier is a binary classifier.

258. The method of any one of claims 252 to 256, wherein: the classifier is a hybrid model classifier.

259. The method of any one of claims 252 to 258, wherein: capturing a portion of the cfDNA isolated from the subject by shuffling comprising: contacting said free DNA with a bait set comprising a plurality of different oligonucleotide-containing probes.

260. The method of any one of claims 252 to 259, wherein: the decoy group is the decoy group of any of lists 155-201.

261. A method, characterized in that the method comprises the steps of:

a) obtaining a set of modified test fragments sequence reads, wherein the modified plurality of test fragments are or have been obtained by processing a set of nucleic acid fragments from a test subject, wherein each of the plurality of nucleic acid fragments corresponds to or is derived from a plurality of genomic regions selected from any one of lists 1 to 16; and

b) applying the set of sequence reads or a detection feature obtained based on the set of sequence reads to a model obtained by a training process having a first set of fragments from training subjects having a first cancer type and a second set of fragments from training subjects having a second cancer type, wherein the first set of fragments and the second set of fragments comprise training fragments.

262. The method of claim 261, wherein: the model includes one of a kernel logistic regression classifier, a random forest classifier, a hybrid model, a convolutional neural network, and an auto-encoder model.

263. The method of any one of claims 261 to 262, wherein: obtaining the set of sequence reads by using the assay detection combination of any one of claims 155 to 201.

Background

DNA methylation plays an important role in regulating gene expression. Aberrant DNA methylation is associated with many disease processes, including cancer. DNA methylation profiling using methylation sequencing (e.g., Whole Genome Bisulfite Sequencing (WGBS)) is increasingly recognized as a valuable diagnostic tool for detecting, diagnosing, and/or monitoring cancer. For example, specific patterns of differentially methylated regions can be used as molecular markers for various diseases.

However, WGBS is not ideally suited for product detection combinations. The reason is that most genomes are not differentially methylated in cancer or the local CpG density is too low to provide a reliable signal. Only a few percent of the genome may be useful in classification.

In addition, various challenges exist in determining differentially methylated regions in various diseases. First, it is determined that differentially methylated regions in a disease group are only of significance compared to a set of control subjects, and therefore, if the number of control groups is small, the determination loses confidence in the small control group. Furthermore, methylation status may vary among a group of control subjects, which is difficult to interpret when determining areas of differential methylation in a disease group. On the other hand, methylation of cytosine at a CpG site is closely related to methylation of subsequent CpG sites. A brief overview of this dependency is a challenge in itself.

Thus, a cost-effective method for accurately detecting a disease by detecting differentially methylated regions has not been achieved.

Disclosure of Invention

Provided herein are several compositions, including: a number of different decoy oligonucleotides, wherein the number of different decoy oligonucleotides are configured to collectively hybridize to a number of DNA molecules derived from at least 200 target genomic regions, wherein in at least one cancer type each genomic region of the at least 200 target genomic regions is differentially methylated compared to in another cancer type or a non-cancer type, and wherein for at least 80% of all possible pairs of the number of cancer types selected from a group comprising at least 10 cancer types, the at least 200 target genomic regions comprise at least one target genomic region that is differentially methylated between the pairs of the number of cancer types.

In some embodiments, the at least 10 cancer types include at least 2, 3, 4, 5, 10, 12, 14, 16, 18, or 20 cancer types. In some embodiments, the several cancer types are selected from uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer. In some embodiments, the several cancer types are selected from anal, bladder, colorectal, esophageal, head and neck, liver/bile duct, lung, lymphoma, ovarian, pancreatic, plasma cell tumor, and gastric cancer. In some embodiments, the several cancer types are selected from thyroid cancer, melanoma, sarcoma, myeloid tumor, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer. In some embodiments, the at least 200 genomic regions of interest are selected from any one of lists 1 to 16. In some embodiments, the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 1-16. In some embodiments, the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 1 to 16. In some embodiments, the at least 200 genomic regions of interest are selected from any one of lists 1 to 3. In some embodiments, the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 1-3. In some embodiments, the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 1-3. In some embodiments, the at least 200 genomic regions of interest are selected from any one of lists 13 to 16. In some embodiments, the at least 200 target genomic regions comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 13-16. In some embodiments, the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 13-16. In some embodiments, the at least 200 genomic regions of interest are selected from list 12. In some embodiments, the at least 200 genomic regions of interest comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions of interest in list 12. In some embodiments, the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in the list 12. In some embodiments, the at least 200 genomic regions of interest are selected from any one of lists 8 to 11. In some embodiments, the at least 200 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 8-11. In some embodiments, the at least 200 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 8-11. In some embodiments, the at least 200 target genomic regions comprise at least 40%, 50%, 60%, or 70% of the number of target genomic regions in table 4. In some embodiments, for at least 90% or 100% of all possible pairs of several cancer types selected from a group comprising at least 10 cancer types, the at least 200 genomic regions of interest comprise at least one genomic region of interest that is differentially methylated between the pairs of several cancer types. In some embodiments, the number of decoy oligonucleotides are hybridized to at least 15 nucleotides or at least 30 nucleotides of the number of DNA molecules derived from the at least 200 genomic regions of interest. In some embodiments, the number of DNA molecules derived from the at least 200 genomic regions of interest are converted cfDNA fragments. In some embodiments, the cfDNA fragments are converted by a process comprising: treatment with bisulfite. In some embodiments, the cfDNA fragments are converted by an enzymatic conversion reaction. In some embodiments, the cfDNA fragments are converted by a cytosine deaminase. In some embodiments, each decoy oligonucleotide is conjugated to an affinity moiety. In some embodiments, the affinity moiety is biotin. In some embodiments, each decoy oligonucleotide is between 50 and 300 bases in length, between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length.

Also provided herein are several compositions comprising: a plurality of different decoy oligonucleotides configured to hybridize to a plurality of DNA molecules derived from at least 100 genomic regions of interest selected from any one of lists 1 to 16.

In some embodiments, the at least 100 genomic regions of interest comprise at least 200 genomic regions of interest. In some embodiments, the at least 100 genomic regions of interest are selected from any one of lists 1 to 16. In some embodiments, the at least 100 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 1-16. In some embodiments, the at least 100 target genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 1 to 16. In some embodiments, the at least 100 genomic regions of interest are selected from any one of lists 1 to 3. In some embodiments, the at least 100 target genomic regions comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of target genomic regions in any one of lists 1-3. In some embodiments, the at least 100 target genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in any one of lists 1 to 3. In some embodiments, the at least 100 genomic regions of interest are selected from list 12. In some embodiments, the at least 100 genomic regions of interest comprise at least 10%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions of interest in list 12. In some embodiments, the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in the list 12. In some embodiments, the at least 100 genomic regions of interest are selected from list 8. In some embodiments, the at least 100 genomic regions of interest comprise at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions of interest in table 8. In some embodiments, the at least 100 target genomic regions comprise at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 target genomic regions in list 8. In some embodiments, the at least 100 target genomic regions comprise at least 40%, 50%, 60%, or 70% of the number of target genomic regions listed in table 4. In some embodiments, the number of DNA molecules derived from the at least 100 genomic regions of interest is the converted number of cfDNA fragments. In some embodiments, the plurality of cfDNA fragments are converted by a process comprising: treatment with bisulfite. In some embodiments, the composition further comprises a plurality of cfDNA fragments from a test subject. In some embodiments, the number of cfDNA fragments from the test subject is a converted number of cfDNA molecules. In some embodiments, the plurality of cfDNA fragments from the test subject are converted by a process comprising: treatment with bisulfite. In some embodiments, each genomic region of interest comprises at least 5 CpG dinucleotides. In some embodiments, each decoy oligonucleotide is between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length. In some embodiments, the different plurality of decoy oligonucleotides comprises: a plurality of sets of two or more decoy oligonucleotides, wherein each decoy oligonucleotide in a set of plurality of decoy oligonucleotides is configured to bind to the converted DNA molecule from the same target genomic region. In some embodiments, the ratio of the number of decoy oligonucleotides configured to be hybridized to the hypermethylated target region to the number of decoy oligonucleotides configured to be hybridized to the hypomethylated target region is between 0.5 and 1.0. In some embodiments, the decoy oligonucleotides of each set comprise one or more pairs of a first decoy oligonucleotide and a second decoy oligonucleotide, each decoy oligonucleotide comprising a 5 'end and a 3' end, a sequence of at least X nucleotide bases at the 3 'end of the first decoy oligonucleotide that is identical to a sequence of X nucleotide bases at the 5' end of the second decoy oligonucleotide, and wherein X is at least 20, at least 25, or at least 30. In some embodiments, X is 30.

Also provided herein are methods for enriching a cfDNA sample, the method comprising: contacting a converted or unconverted cfDNA sample with a bait set as described above; and enriching cfDNA samples corresponding to a first set of genomic regions by heterozygous capture. In some embodiments, the sample of cfDNA is a converted cfDNA sample.

Also provided herein are methods for obtaining sequence information that can provide information of the presence or absence of a cancer or a type of cancer, comprising: sequencing enriched converted cfDNA prepared via a method comprising contacting a converted or unconverted cfDNA sample with a bait set as described above; and enriching cfDNA samples corresponding to a first set of genomic regions by heterozygous capture. In some embodiments, the sample of cfDNA is a converted cfDNA sample.

Also provided herein are methods for determining the presence or absence of a cancer in a subject, the method comprising the steps of: capturing a number of cfDNA fragments from the subject with the above-described composition, sequencing the captured number of cfDNA fragments, and applying a trained classifier to the number of cfDNA sequences to determine the presence or absence of cancer. In some embodiments, the likelihood of a false positive determination of the presence or absence of cancer is less than 1%, and the likelihood of an accurate determination of the presence or absence of cancer is at least 40%. In some embodiments, the cancer is a first stage cancer, the likelihood of a false positive determination of the presence or absence of cancer is less than 1%, and the likelihood of an accurate determination of the presence or absence of cancer is at least 10%. In some embodiments, the number of cfDNA fragments is a number of converted cfDNA fragments.

Also provided herein are methods for detecting a type of cancer, the method comprising the steps of: capturing a plurality of cfDNA fragments from a subject with a composition comprising a plurality of different oligonucleotide decoys, sequencing the captured plurality of cfDNA fragments, and applying a trained classifier to the plurality of cfDNA sequences to determine a cancer type; wherein the plurality of oligonucleotide decoys are configured to hybridize to a plurality of cfDNA fragments derived from a plurality of genomic regions of interest, wherein the plurality of genomic regions of interest are differentially methylated in one or more cancer types compared to in a different cancer type or a non-cancer type, wherein a likelihood of a false positive determination of cancer is less than 1%, and wherein an accurately assigned likelihood of a cancer type is at least 75%, at least 80%, at least 85%, at least 89%, or at least 90%. Some embodiments further comprise: applying the trained classifier to the number of cfDNA sequences to determine the presence or absence of cancer prior to determining the cancer type.

In some embodiments, the cancer type is a first stage cancer and an accurately assigned likelihood is at least 75%. In some embodiments, the cancer type is a stage ii cancer and an accurately assigned likelihood is at least 85%. In some embodiments, the cancer type is prostate cancer and an accurately assigned likelihood of prostate cancer is at least 85% or at least 95%. In some embodiments, the cancer type is breast cancer, and an accurately assigned likelihood of breast cancer is at least 90% or at least 95%. In some embodiments, the cancer type is uterine cancer, and an accurately assigned likelihood of uterine cancer is at least 90% or at least 95%. The cancer type is ovarian cancer, and an accurately assigned likelihood of ovarian cancer is at least 85% or at least 90%. In some embodiments, the cancer types are bladder cancer and urothelial cancer, and the likelihood of an accurate assignment of bladder cancer and urothelial cancer is at least 90% or at least 95%. The cancer type is colorectal cancer, and the likelihood of an accurate assignment of colorectal cancer is at least 65% or at least 70%. In some embodiments, the cancer type is liver cancer and cholangiocarcinoma, and an accurately assigned likelihood of liver cancer and cholangiocarcinoma is at least 90% or at least 95%. In some embodiments, the cancer types are pancreatic cancer and gallbladder cancer, and an accurately assigned likelihood of pancreatic cancer and gallbladder cancer is at least 85% or at least 90%. The several cfDNA fragments are converted cfDNA fragments. In some embodiments, the cancer type is selected from uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer. In some embodiments, the cancer type is selected from anal, bladder, colorectal, esophageal, head and neck, liver/bile duct, lung, lymphoma, ovarian, pancreatic, plasma cell tumor, and gastric cancer. In some embodiments, the cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid tumors, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer. In some embodiments, the likelihood of detecting a sarcoma is at least 35% or at least 40%. In some embodiments, the likelihood of detecting stage three or stage four renal cancer is at least 50% or at least 70%. In some embodiments, the likelihood of detecting stage three or stage four breast cancer is at least 70% or at least 85%. In some embodiments, the likelihood of detecting stage three or stage four uterine cancer is at least 50%. In some embodiments, the likelihood of detecting ovarian cancer is at least 60% or at least 80%. In some embodiments, the likelihood of detecting bladder cancer is at least 35% or at least 40%. In some embodiments, the likelihood of detecting anorectal cancer is at least 60% or 70%. In some embodiments, the likelihood of detecting head and neck cancer is at least 75% or at least 80%. In some embodiments, the likelihood of detecting first stage head and neck cancer is at least 80%. In some embodiments, the likelihood of detecting colorectal cancer is at least 50% or at least 59%. In some embodiments, the likelihood of detecting liver cancer is at least 75% or less than 80%. In some embodiments, the likelihood of detecting pancreatic cancer and gallbladder cancer is at least 64% or at least 70%. In some embodiments, the likelihood of detecting upper digestive tract cancer is at least 60% or at least 68%. In some embodiments, the likelihood of detecting multiple myeloma is at least 65% or at least 75%. In some embodiments, the likelihood of detecting stage i multiple myeloma is at least 60%. In some embodiments, the likelihood of detecting lymphoma is at least 65% or at least 69%. In some embodiments, the likelihood of detecting lung cancer is at least 50% or at least 58%. In some embodiments, the composition comprising a plurality of oligonucleotide decoys is a composition provided above. In some embodiments, the plurality of genomic regions comprises: no more than 95000 genomic regions, no more than 60000 genomic regions, no more than 40000 genomic regions, no more than 35000 genomic regions, no more than 20000 genomic regions, no more than 15000 genomic regions, no more than 8000 genomic regions, no more than 4000 genomic regions, no more than 2000 genomic regions, or no more than 1400 genomic regions. In some embodiments, the total size of the several genomic regions is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB. In some embodiments, the subject has a high risk of one or more cancer types. In some embodiments, the subject exhibits symptoms associated with one or more cancer types. In some embodiments, the subject is not diagnosed with cancer. In some embodiments, the classifier is trained on a number of converted DNA sequences derived from at least 100 subjects having a first cancer type, at least 100 subjects having a second type of cancer, and at least 100 subjects not having cancer. In some embodiments, the first cancer type is ovarian cancer. In some embodiments, the first cancer type is liver cancer. In some embodiments, the first cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid tumor, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, pancreatic cancer, gall bladder cancer, esophageal cancer, gastric cancer, multiple myeloma, lymphoma, lung cancer, and leukemia. In some embodiments, the classifier is trained over a number of converted DNA sequences derived from at least 1000, at least 2000, or at least 4000 target genomic regions selected from any one of lists 1 to 16.

In some embodiments, the classifier is trained over a number of converted DNA sequences derived from at least 1000, at least 2000, or at least 4000 target genomic regions selected from any one of lists 1 to 16. In some embodiments, the trained classifier determines the presence or absence of cancer or a cancer type by: (a) generating a set of features for a sample, wherein each feature in the set of features comprises a numerical value; (b) inputting the set of features into the classifier, wherein the classifier comprises a polynomial classifier; (c) based on the set of features, determining a set of probability scores at the classifier, wherein the set of probability scores comprises one probability score for each cancer type class and each non-cancer type class; and (d) determining a final cancer classification for the sample by thresholding the set of probability scores based on one or more values determined during training of the classifier. In some embodiments, the set of features includes a set of binarized features. In some embodiments, the numerical value comprises a single binary value. In some embodiments, the polynomial classifier comprises a polynomial logistic regression ensemble trained to predict a source tissue for the cancer. In some embodiments, the classifier determines the final cancer classification based on a highest two probability score differences with respect to a minimum value, where the minimum value corresponds to a predefined proportion of training cancer samples that were assigned the correct cancer type as the highest score at the time of training of the classifier. In some embodiments, upon determining that the highest two probability scores differ by more than the minimum value, the classifier assigns a cancer label as the final cancer classification, the cancer label corresponding to the highest probability score determined by the classifier; and assigning an uncertain cancer label as the final cancer classification upon determining that the highest two probability score differences do not exceed the minimum.

Also provided herein are methods of treating a cancer type in a subject in need thereof, the method comprising the steps of: detecting the type of cancer by the method described above; and applying an anti-cancer therapeutic to the subject. In some embodiments, the anticancer therapeutic is a chemotherapeutic selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, antitumor antibiotics, cytoskeletal disruptors, topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.

Also provided herein are a plurality of cancer assay detection combinations comprising: at least 500 pairs of probes, wherein each of the at least 500 pairs of probes comprises: two probes configured to overlap with each other by an overlapping sequence, wherein the overlapping sequence comprises a 30-nucleotide sequence, and wherein the 30-nucleotide sequence is configured to hybridize to a converted cfDNA molecule corresponding to, or derived from, one or more genomic regions, wherein each of the several genomic regions comprises at least five methylation sites, and wherein the at least five methylation sites have an aberrant methylation pattern in several cancer samples.

In some embodiments, each of the at least 500 pair of probes is conjugated to a non-nucleotide affinity moiety. In some embodiments, the non-nucleotide affinity moiety is a biotin moiety. In some embodiments, the plurality of cancer samples are from a plurality of subjects having a cancer selected from the group consisting of breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of the renal pelvis, renal cancer other than urothelium, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer caused by hepatocytes, hepatobiliary cancer caused by cells other than hepatocytes, pancreatic cancer, upper gastrointestinal squamous cell carcinoma, upper gastrointestinal cancer other than squamous cell carcinoma, head and neck cancer, lung adenocarcinoma, small cell lung cancer, squamous cell lung cancer and adenocarcinoma or cancers other than small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, multiple myeloma, lymphoma, and leukemia. In some embodiments, the aberrant methylation pattern has at least a threshold value of p-value rarity in the plurality of cancer samples. In some embodiments, each of the number of probes is designed to have less than 20 off-target genomic regions. In some embodiments, the less than 20 off-target genomic regions are identified using a k-mer seeding strategy (k-mer seeding strategy). In some embodiments, the less than 20 off-target genomic regions are identified using a k-mer seeding strategy in combination with local alignment at several seed sites. In some embodiments, the cancer assay detection combination comprises: at least 10000, 50000, 100000, 200000, 300000, 400000, 500000, 600000, 700000 or 800000 probes. In some embodiments, the at least 500 pairs of probes collectively comprise at least 2 million, 3 million, 4 million, 5 million, 6 million, 8 million, 1 thousand 2 million, 1 thousand 4 million, or 1 thousand 5 million nucleotides. In some embodiments, each of the number of probes comprises at least 50, 75, 100, or 120 nucleotides. In some embodiments, each of the number of probes comprises less than 300, 250, 200, or 150 nucleotides. In some embodiments, each of the number of probes comprises 100 to 150 nucleotides. In some embodiments, each of the number of probes comprises less than 20, 15, 10, 8, or 6 methylation sites. In some embodiments, at least 80, 85, 90, 92, 95, or 98% of the at least five methylation sites are methylated or unmethylated in the number of cancer samples. In some embodiments, at least 3%, 5%, 10%, 15%, or 20% of the plurality of probes do not include guanine G. In some embodiments, each of the plurality of probes comprises a plurality of binding sites to the plurality of methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the plurality of binding sites comprise only CpG or CpA. In some embodiments, each of the number of probes is configured to have less than 15, 10, or 8 off-target genomic regions. In some embodiments, at least 30% of the several genomic regions are in exons or introns. In some embodiments, at least 15% of the several genomic regions are in exons. In some embodiments, at least 20% of the several genomic regions are in exons. In some embodiments, less than 10% of the several genomic regions are in intergenic regions. In some embodiments, the number of genomic regions is selected from any one of lists 1-3 or lists 4-16. In some embodiments, the number of genomic regions comprises at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions in any of lists 1-3 or lists 4-16. In some embodiments, the number of genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, 50000, 60000, or 70000 genomic regions in any one of lists 1 to 3 or lists 4 to 16.

Also provided herein are a plurality of cancer assay detection combinations, comprising a plurality of probes, wherein each of the number of probes is configured to hybridize to a converted cfDNA molecule corresponding to one or more of the number of genomic regions of any of lists 1-3 or 4-16.

In some embodiments, the number of probes are configured together to heterozygous to a number of converted cfDNA molecules corresponding to at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions of any of lists 1-3 or lists 4-16.

In some embodiments, the number of probes are configured together to hybridize to a number of converted cfDNA molecules corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, or 50000 genomic regions of any of lists 1-3 or lists 4-16. In some embodiments, at least 3%, 5%, 10%, 15%, or 20% of the plurality of probes do not include guanine G. In some embodiments, each of the plurality of probes comprises a plurality of binding sites to a plurality of methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the plurality of binding sites comprise only CpG or CpA. In some embodiments, each of the plurality of probes is conjugated to a non-nucleotide affinity moiety. In some embodiments, the non-nucleotide affinity moiety is a biotin moiety.

Also provided herein are methods of determining a tissue of origin (TOO) of a cancer, the method comprising the steps of: receiving a sample, the sample comprising a plurality of cfDNA molecules; processing the plurality of cfDNA molecules to convert unmethylated cytosine C to uracil U, thereby obtaining a plurality of converted cfDNA molecules; applying a cancer assay detection combination provided above to the plurality of converted cfDNA molecules, thereby enriching a subset of the plurality of converted cfDNA molecules; and sequencing the enriched subset of the converted cfDNA molecules, thereby providing a set of sequence reads.

Some embodiments further provide the step of: determining a health condition by evaluating the set of sequence reads, wherein the health condition is the presence or absence of cancer; the presence or absence of a cancer of a tissue of origin (TOO); the presence or absence of a cancer cell type; the presence or absence of at least 5, 10, 15 or 20 different types of cancer. In some embodiments, the sample comprises a plurality of cfDNA molecules obtained from a human subject.

Also provided herein are methods of detecting a cancer, comprising the steps of: obtaining a set of sequence reads by sequencing a set of nucleic acid fragments from a subject, wherein the plurality of nucleic acid fragments correspond to or are derived from a plurality of genomic regions selected from any one of lists 1-3 or lists 4-16; determining, for each of the plurality of nucleic acid fragments, a methylation status at a plurality of CpG sites; and detecting a health status of the subject by evaluating the methylation status of the plurality of sequence reads, wherein the health status is: (i) the presence or absence of a cancer; (ii) the presence or absence of a cancer of a source tissue; (iii) the presence or absence of a cancer cell type; or (iv) the presence or absence of at least 5, 10, 15 or 20 different types of cancer.

In some embodiments, the number of genomic regions comprises at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the number of genomic regions in any of lists 1-3 or lists 4-16. In some embodiments, the number of genomic regions comprises at least 500, 1000, 5000, 10000, 15000, 20000, 30000, 40000, 50000, 60000, 70000, or 80000 genomic regions of the number of genomic regions in any one of lists 1 through 3 or lists 4 through 16.

Also provided herein are methods of designing a cancer assay detection set for diagnosing cancer of a tissue of origin (TOO), the method comprising the steps of: identifying a number of genomic regions, wherein each of the number of genomic regions (i) comprises at least 30 nucleotides, and (ii) comprises at least five methylation sites; selecting a subset of the genomic regions, wherein the selecting is performed when a plurality of cfDNA molecules have an aberrant methylation pattern, the plurality of cfDNA molecules corresponding to or derived from each of the genomic regions in a plurality of cancer samples, wherein the aberrant methylation pattern comprises at least five hypomethylated or hypermethylated methylation sites, and designing a cancer assay detection combination comprising a plurality of probes, wherein each of the plurality of probes is configured for hybridization to a converted cfDNA molecule, the converted cfDNA molecule corresponding to or derived from one or more of the subset of genomic regions.

Also provided herein are decoy sets for hybrid capture, a decoy set comprising a plurality of different oligonucleotide-containing probes, wherein each of the plurality of oligonucleotide-containing probes comprises a sequence of at least 30 bases in length that is complementary to any one of: (1) a sequence of a genomic region; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region; and wherein the plurality of different oligonucleotide-containing probes are complementary to a sequence corresponding to a CpG site that is differentially methylated in samples from subjects from a first cancer type as compared to samples from subjects from a second cancer type or a non-cancer type.

In some embodiments, the first cancer type and the second cancer type are selected from uterine cancer, upper gastrointestinal squamous cancer, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer.

A decoy set according to any one of claims 140 to 141, wherein the decoy set comprises at least 500, 1000, 2000, 2500, 5000, 6000, 7500, 10000, 15000, 20000, 25000, 50000, 100000, 200000, 300000, 500000 or 800000 different oligonucleotide-containing probes. In some embodiments, for each of the several different oligonucleotide-containing probes, the sequence that is at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 1 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 1 to 3; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of list 5 or 7; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 4, 8 or 8 to 12; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 13 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of lists 13 to 16; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from the plurality of genomic region groups of any one of list 4 or 6; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 4; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 8; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 9; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in list 10; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 11; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, the sequence of at least 30 bases in length is complementary to any one of: (1) a sequence within a genomic region selected from said plurality of genomic region groups in table 12; or (2) a sequence that differs from the sequence of (1) only by one or more transitions, wherein each respective transition of the one or more transitions occurs at a cytosine of the genomic region. In some embodiments, each of the plurality of different oligonucleotide-containing probes is conjugated to an affinity moiety. In some embodiments, the affinity moiety is biotin. In some embodiments, at least 80%, 90%, or 95% of the plurality of oligonucleotide-containing probes in the decoy set do not include: a sequence of at least 30, at least 40, or at least 45 bases in the genome having 20 or more off-target regions. In some embodiments, the plurality of oligonucleotide-containing probes in the decoy set does not comprise: a sequence of at least 30, at least 40, or at least 45 bases in the genome having 20 or more off-target regions. In some embodiments, the sequence of at least 30 bases of each of the number of probes is at least 40, at least 45, at least 50, at least 60, at least 75, or at least 100 bases in length. In some embodiments, each of the plurality of oligonucleotide-containing probes has a nucleic acid sequence of at least 45, 40, 75, 100, or 120 bases in length. In some embodiments, each of the plurality of oligonucleotide-containing probes has a nucleic acid sequence of no more than 300, 250, 200, or 150 bases in length. In some embodiments, each of the plurality of different oligonucleotide-containing probes is between 60 and 200 bases in length, between 100 and 150 bases in length, between 110 and 130 bases in length, and/or 120 bases in length. In some embodiments, the plurality of different oligonucleotide-containing probes comprises at least 500, at least 1000, at least 2000, at least 2500, at least 5000, at least 6000, at least 7500, and at least 10000, at least 15000, at least 20000, or at least 25000 different probe pairs, wherein each probe pair comprises a first probe and a second probe, wherein the second probe is different from the first probe and overlaps the first probe by an overlapping sequence that is at least 30, at least 40, at least 50, or at least 60 nucleotides in length. In some embodiments, the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 1-16. In some embodiments, the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 1-3. In some embodiments, the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 4-12. In some embodiments, the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in any of lists 4, 6, or 8-12. In some embodiments, the decoy set comprises a number of oligonucleotide-containing probes configured to target at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or 100% of the number of genomic regions identified in list 8. In some embodiments, an entirety of the plurality of oligonucleotide probes in the decoy set is configured to hybridize to a plurality of fragments obtained from a plurality of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the plurality of genomic regions in a list selected from any one of lists 1-16. In some embodiments, the ensemble of a number of oligonucleotide probes in the decoy set is configured to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from any one of lists 1-3. In some embodiments, the overall configuration of a number of oligonucleotide probes in the decoy set is to hybrid to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from any one of lists 4 to 12. In some embodiments, the overall configuration of a number of oligonucleotide probes in the decoy set is to hybrid to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95% of the number of genomic regions in a list selected from any one of lists 4, 6, or 8 to 12. In some embodiments, the ensemble of a number of oligonucleotide probes in the decoy set is configured to hybridize to a number of fragments obtained from a number of cfDNA molecules corresponding to at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of the number of genomic regions in a list selected from list 8. In some embodiments, a whole of a number of oligonucleotide-containing probes in the decoy set are configured to hybridize to a number of fragments obtained from a number of cfDNA molecules, the number of fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 1 to 16. In some embodiments, the overall configuration of several oligonucleotide-containing probes in the decoy set is used to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 1 to 3. In some embodiments, the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 4 to 12. In some embodiments, the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions of any of lists 4, 6, or 8 to 12. In some embodiments, the overall configuration of several oligonucleotide-containing probes in the decoy set is to hybridize to several fragments obtained from several cfDNA molecules, the several fragments corresponding to at least 500, 1000, 5000, 10000, 15000, 20000, at least 25000, at least 30000, at least 50000, or at least 80000 genomic regions in table 8. In some embodiments, the plurality of oligonucleotide-containing probes comprises at least 500, 1000, 5000, or 10000 distinct pluralities of probe subsets, wherein each probe subset comprises a plurality of probes collectively extending across a genomic region selected from the plurality of genomic regions of any one of lists 1 through 16 in a 2 x tiled manner (2 x tiled fashion). In some embodiments, the plurality of oligonucleotide-containing probes comprises at least 500, 1000, 5000, or 10000 distinct pluralities of probe subsets, wherein each probe subset comprises a plurality of probes collectively extending across a genomic region selected from the plurality of genomic regions of any one of lists 1 to 4, 6, or 8 to 12 in a 2 x tiled manner. In some embodiments, the plurality of probes collectively extending across the genomic region in a 2 x tiling manner comprises: at least one pair of probes overlapping a sequence of at least 30 bases, at least 40 bases, at least 50 bases, or at least 60 bases in length. In some embodiments, the plurality of probes collectively extend across a plurality of portions of the genome, the plurality of portions having a combined size (combined size) of less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB. In some embodiments, the plurality of probes collectively extend across a plurality of portions of the genome, the plurality of portions having a combined size of between 0.2 and 30MB, between 0.5MB and 30MB, between 1MB and 30MB, between 3MB and 25MB, between 3MB and 15MB, between 5MB and 20MB, or between 7MB and 12 MB. In some embodiments, each of the plurality of different oligonucleotide-containing probes comprises less than 20, 15, 10, 8, or 6 CpG detection sites. In some embodiments, at least 80%, 85%, 90%, 92%, 95%, or 98% of the plurality of oligonucleotide-containing probes have only CpG or CpA at all CpG detection sites.

Also provided herein are several mixtures comprising: converted cfDNA; and a bait set as provided above. In some embodiments, the converted cfDNA comprises bisulfite converted cfDNA.

The mixture of claim 187, wherein the converted cfDNA comprises cfDNA converted via a cytosine deaminase.

Also provided herein are methods for enriching a converted cfDNA sample, the method comprising: contacting the converted cfDNA sample with a bait set provided above; and enriching the sample for a first set of genomic regions by heterozygous capture.

Also provided herein are methods for providing sequence information that can provide information on the presence or absence of a cancer or a type of cancer, comprising the steps of: treating cfDNA from a biological sample with a deaminating agent to produce an episomal DNA sample comprising a plurality of deaminated nucleotides; enriching the cfDNA sample for indicative of a number of free DNA molecules; and sequencing the enriched number of cfDNA molecules, thereby obtaining a set of sequence reads indicative of the presence or absence of a cancer or a type of cancer.

In some embodiments, enriching the cfDNA comprises: amplifying several portions of the several episomal DNA fragments by PCR using several primers configured to hybridize to several genomic regions selected from any one of lists 1-16. In some embodiments, enriching the cfDNA comprises: contacting the episomal DNA with a number of probes configured to hybridize to a number of converted fragments obtained from the number of cfDNA molecules, the number of converted fragments corresponding to or derived from the number of genomic regions of any one of tables 1-16. In some embodiments, enriching the cfDNA comprises: contacting the episomal DNA with a number of probes configured to hybridize to a number of converted fragments obtained from the number of cfDNA molecules, the number of converted fragments corresponding to or derived from at least 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% of the number of genomic regions of any one of lists 1-16. In some embodiments, the number of genomic regions is selected from any one of lists 1-3. In some embodiments, the number of genomic regions is selected from any one of lists 4 to 12. In some embodiments, the number of genomic regions is selected from any one of lists 4, 6, or 8 to 12. In some embodiments, the number of genomic regions is selected from table 8. In some embodiments, the cfDNA sample is enriched by the methods provided above. In some embodiments, the method further comprises: determining a cancer classification by evaluating the set of sequence reads, wherein the cancer classification is the presence or absence of cancer; the presence or absence of a type of cancer. In some embodiments, the step of determining a cancer classification comprises the steps of: generating a detection feature vector based on the set of sequence reads; and applying the detected feature vector to a classifier. In some embodiments, the classifier includes a model trained by a training process having a first set of cancer fragments from one or more training subjects having a first cancer type and a second set of cancer fragments from one or more training subjects having a second cancer type, wherein the first set of cancer fragments and the second set of cancer fragments include training fragments. In some embodiments, the cancer classification is the presence or absence of cancer. In some embodiments, an area of the classifier under a receiver operating characteristic curve is at least 0.8. In some embodiments, the cancer classification is a type of cancer. In some embodiments, the type of cancer is selected from at least 12, 14, 16, 18, or 20 cancer types. In some embodiments, the several cancer types are selected from uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer. In some embodiments, the several cancer types are selected from anal, bladder, colorectal, esophageal, head and neck, liver/bile duct, lung, lymphoma, ovarian, pancreatic, plasma cell tumor, and gastric cancer. In some embodiments, the several cancer types are selected from thyroid cancer, melanoma, sarcoma, myeloid tumor, kidney cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer. In some embodiments, wherein at a specificity of 99%, the method has a sensitivity to head and neck cancer of at least 79% or at least 84%; wherein the sensitivity of the method to liver cancer is at least 82% or at least 85% at a specificity of 99%; wherein at 99% specificity, the method has a sensitivity to upper digestive tract cancer of at least 62% or at least 68%; wherein at a specificity of 99%, the method has a sensitivity to pancreatic cancer or gallbladder cancer of at least 62% or at least 68%; wherein the sensitivity of the method to colorectal cancer is at least 60% or at least 65% at a specificity of 99%; wherein at a specificity of 99%, the method has a sensitivity to ovarian cancer of at least 75% or at least 80%; wherein the sensitivity of the method to liver cancer is at least 60% or at least 65% at a specificity of 99%; wherein the sensitivity of the method to multiple myeloma is at least 68% or at least 75% at a specificity of 99%; wherein the sensitivity of the method to lymphoma is at least 65% or at least 70% at a specificity of 99%; wherein the sensitivity of the method to anorectal cancer is at least 60% or at least 65% at a specificity of 99%; and wherein at a specificity of 99%, the sensitivity of the method to bladder cancer is at least 40% or at least 44%. In some embodiments, the cancer classification is the presence or absence of a type of cancer. In some embodiments, the step of determining a cancer classification comprises the steps of: generating a detection feature vector based on the set of sequence reads; and applying the detected feature vector to a classifier. In some embodiments, the classifier includes a model trained by a training process having transformed DNA sequences from a first cancer type group, one or more training subjects from a first cancer type, and transformed DNA sequences from a second cancer type group, one or more training subjects from a second cancer type, wherein the transformed DNA sequences of the first cancer type group and the transformed DNA sequences of the second cancer type group comprise trained transformed DNA sequences. In some embodiments, the cancer type is selected from the group consisting of head and neck cancer, liver/bile duct cancer, upper digestive tract cancer, pancreas/gall bladder cancer, colorectal cancer, ovarian cancer, lung cancer, multiple myeloma, lymphoma, melanoma, sarcoma, breast cancer, and uterine cancer. In some embodiments, the type of cancer is head and neck cancer, and the method has a sensitivity of at least 79% or at least 84% at a specificity of 99%. In some embodiments, the type of cancer is liver cancer, and the method has a sensitivity of at least 82% or at least 85% at a specificity of 99%. In some embodiments, the type of cancer is upper digestive tract cancer, and the method has a sensitivity of at least 62% or at least 68% at a specificity of 99%. In some embodiments, the type of cancer is pancreatic cancer or gallbladder cancer, and the method has a sensitivity of at least 62% or at least 68% at a specificity of 99%. In some embodiments, the type of cancer is colorectal cancer, and the method has a sensitivity of at least 60% or at least 65% at a specificity of 99%. In some embodiments, the type of cancer is ovarian cancer, and the method has a sensitivity of at least 75% or at least 80% at a specificity of 99%. In some embodiments, the type of cancer is lung cancer, and the method has a sensitivity of at least 60% or at least 65% at a specificity of 99%. In some embodiments, the type of cancer is multiple myeloma and the method has a sensitivity of at least 68% or at least 75% at a specificity of 99%. In some embodiments, the type of cancer is lymphoma, and the method has a sensitivity of at least 65% or at least 70% at a specificity of 99%. In some embodiments, the type of cancer is anorectal cancer, and the method has a sensitivity of at least 60% or at least 65% at a specificity of 99%. In some embodiments, the type of cancer is bladder cancer, and the method has a sensitivity of at least 40% or at least 44% at a specificity of 99%. In some embodiments, the total size of the number of target genomic regions is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB. In some embodiments, the step of determining a cancer classification comprises the steps of:

Generating a detection feature vector based on the set of sequence reads; and applying the detected feature vectors to a model obtained by a training process having a set of cancer fragments from one or more training subjects with cancer and a set of non-cancer fragments from one or more training subjects without cancer, wherein the set of cancer fragments and the set of non-cancer fragments comprise a number of trained fragments. In some embodiments, the training process comprises the steps of: obtaining sequence information from a plurality of training segments of a plurality of training subjects; for each training fragment, determining whether the training fragment is hypomethylated or hypermethylated, wherein each of the hypomethylated training fragments and hypermethylated training fragments comprises: at least a threshold number of CpG sites, wherein at least a threshold percentage of CpG sites are unmethylated or methylated, respectively; for each training subject, generating a training feature vector based on the hypomethylated training fragments and the hypermethylated training fragments; and training the model using the training feature vectors from the one or more training subjects without cancer and training feature vectors from the one or more training subjects with cancer. In some embodiments, the training process comprises the steps of: obtaining sequence information from a plurality of training segments of a plurality of training subjects; for each training fragment, determining whether the training fragment is hypomethylated or hypermethylated, wherein each of the hypomethylated training fragments and hypomethylated training fragments comprises: at least a threshold number of CpG sites, wherein at least a threshold percentage of CpG sites are unmethylated or methylated, respectively; for each of several CpG sites in a reference genome: quantifying a count of hypomethylated training fragments that overlap said CpG sites and a count of hypermethylated training fragments that overlap said CpG sites; and generating a hypomethylation score and an hypermethylation score based on the counts of the hypomethylation training fragments and the hypermethylation training fragments; for each training fragment, generating a total hypomethylation score based on the hypomethylation scores of the CpG sites in the training fragment, and generating a total hypermethylation score based on the hypermethylation scores of the CpG sites in the training fragment; for each training subject: ranking the plurality of training fragments based on the total hypomethylation score, ranking the plurality of training fragments based on the total hypermethylation score; generating a feature vector based on the ranking of the training segments; obtaining a number of training feature vectors for one or more training subjects that do not have cancer and a number of training feature vectors for one or more training subjects that have cancer; and training the model using a number of feature vectors of one or more training subjects that do not have cancer and a number of feature vectors of one or more training subjects that have cancer. In some embodiments, the model comprises one of a kernel logistic regression classifier, a random forest classifier, a hybrid model, a convolutional neural network, and an auto-encoder model. In some embodiments, the method further comprises the step of: obtaining a probability of cancer for the test sample based on the model; and comparing the cancer probability to a threshold probability to determine whether the test sample is from a subject with cancer or without cancer. In some embodiments, the method further comprises the step of: obtaining a probability of a cancer type for the test sample based on the model; and comparing the cancer type probability to a threshold probability to determine whether the test sample is from a subject with cancer type or other cancer type or no cancer. In some embodiments, the method further comprises: administering an anti-cancer agent to the subject.

Also provided herein is a method for treating a cancer patient, the method comprising:

by a method provided above, an anti-cancer agent is administered to a subject identified as a cancer patient. In some embodiments, the anti-cancer agent is a chemotherapeutic agent selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, antitumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.

Also provided herein are methods for treating a cancer patient, comprising: by a method provided above, an anti-cancer agent is administered to a subject identified as a cancer patient. In some embodiments, the anti-cancer agent is a chemotherapeutic agent selected from the group consisting of alkylating agents (alkylating agents), antimetabolites (antimetabolites), anthracyclines (anthracyclines), antitumor antibiotics, cytoskeleton disruptors (taxans), topoisomerase inhibitors (topoisomerases inhibitors), mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.

Also provided herein are methods for assessing whether an object has a cancer, the method comprising: obtaining cfDNA from the subject; isolating a portion of the cfDNA from the subject by heterozygous capture; obtaining a number of sequence reads derived from the captured cfDNA to determine a number of methylation states of a number of cfDNA fragments; applying a classifier to the plurality of sequence reads; and determining whether the subject has cancer based on the application of the classifier; wherein an area of the classifier under the receiver operator characteristic curve is at least 0.80. In some embodiments, the method further comprises: determining a type of cancer, wherein the sensitivity of the method to head and neck cancer is at least 79% or at least 84%; wherein the sensitivity of the method to liver cancer is at least 82% or at least 85%; wherein the sensitivity of the method to upper digestive tract cancer is at least 62% or at least 68%; wherein the sensitivity of the method to pancreatic cancer or gallbladder cancer is at least 62% or at least 68%; wherein the sensitivity of the method to colorectal cancer is at least 60% or at least 65%; wherein the sensitivity of the method to ovarian cancer is at least 75% or at least 80%; wherein the method has a sensitivity to lung cancer of at least 60% or at least 65%; wherein the sensitivity of the method to multiple myeloma is at least 68% or at least 75%; wherein the sensitivity of the method to lymphoma is at least 65% or at least 70%; wherein the sensitivity of the method to anorectal cancer is at least 60% or at least 65%; and wherein the sensitivity of the method to bladder cancer is at least 40% or at least 44%. In some embodiments, the total size of the number of target genomic regions is less than 4MB, less than 2MB, less than 1MB, less than 0.7MB, or less than 0.4 MB. In some embodiments, the method further comprises: converting unmethylated cytosines in the cfDNA to uracil prior to capturing the portion of the cfDNA isolated from the subject by shuffling. In some embodiments, the method further comprises: converting unmethylated cytosines in the cfDNA to uracil prior to capturing the portion of the cfDNA isolated from the subject by shuffling. In some embodiments, the classifier is a binary classifier. In some embodiments, the classifier is a hybrid model classifier. In some embodiments, capturing a portion of the cfDNA isolated from the subject by hybridization comprises: contacting said free DNA with a bait set comprising a plurality of different oligonucleotide-containing probes. In some embodiments, the decoy group is a decoy group provided herein.

Also provided herein are methods comprising the steps of: obtaining a set of modified test fragments sequence reads, wherein the modified plurality of test fragments are or have been obtained by processing a set of nucleic acid fragments from a test subject, wherein each of the plurality of nucleic acid fragments corresponds to or is derived from a plurality of genomic regions selected from any one of lists 1 to 16; and applying the set of sequence reads or a detection feature obtained based on the set of sequence reads to a model obtained by a training process having a first set of fragments from a plurality of training subjects having a first cancer type and a second set of fragments from a plurality of training subjects having a second cancer type, wherein the first set of fragments and the second set of fragments comprise a plurality of trained fragments.

In some embodiments, the model comprises one of a kernel logistic regression classifier, a random forest classifier, a hybrid model, a convolutional neural network, and an auto-encoder model. In some embodiments, the set of sequence reads is obtained by using a combination of assay tests provided herein.

Is incorporated by reference

All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.

Drawings

The novel features believed characteristic of the disclosure are set forth in the appended claims, including the details. A better understanding of these features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth several illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:

FIG. 1A depicts a 2 × tiled probe design according to one embodiment, with three probes for a small target area and at least two probes covering each base in a target area (boxed in a dashed rectangle).

FIG. 1B depicts a 2 × tiled probe design according to one embodiment, with more than three probes for a larger target area, and with at least two probes covering each base in a target area (boxed in a dashed rectangle).

FIG. 1C depicts probe designs for several hypomethylated and/or hypermethylated fragments in several genomic regions, according to one embodiment.

FIG. 2 shows a process for generating a cancer assay detection kit according to one embodiment.

FIG. 3A is a flow diagram describing a process for creating a data structure for a control group according to one embodiment.

FIG. 3B is a flowchart describing an additional step of verifying the data structure for the control group of FIG. 3A, according to one embodiment.

FIG. 4 is a flow chart describing a process for selecting genomic regions for designing probes for a cancer assay detection assembly, according to one embodiment.

Fig. 5 is a depiction of an exemplary p-value score calculation, in accordance with an embodiment.

FIG. 6A is a flow chart describing a process of training a classifier based on hypomethylated and hypermethylated segments indicative of a cancer, according to one embodiment.

FIG. 6B is a flow chart describing a process for determining segments indicative of cancer by probability model according to one embodiment.

FIG. 7A is a flow chart describing a process for sequencing a fragment of cell-free (cf) DNA, according to one embodiment.

FIG. 7B is an illustration of a process of 7A for sequencing a fragment of cell-free (cf) DNA to obtain a methylation state vector, according to one embodiment.

Figure 8 shows the degree of bisulfite conversion (upper panel) and the average coverage/depth of sequencing for different stages of cancer (lower panel).

Figure 9 shows the concentration of cfDNA for each sample at different stages of cancer.

FIG. 10 is a graph of the number of DNA fragments bound to probes according to the size of overlap between the DNA fragments and probes.

Fig. 11A summarizes the frequency of genome annotation for several target genomic regions of list 1 (black) and randomly selected genomic regions (grey). Fig. 11B summarizes the frequency of genome annotation for several target genomic regions of table 2 (black) and several randomly selected genomic regions (grey). Fig. 11C summarizes the frequency of genome annotation for several target genomic regions of table 3 (black) and several genomic regions randomly selected (grey).

FIG. 12A depicts a flowchart of an apparatus for sequencing a nucleic acid sample according to one embodiment. Fig. 12B depicts an analysis system that analyzes cfDNA methylation status according to one embodiment.

Fig. 13 is a shaded matrix (shaded matrix) representing the number of genomic regions selected to distinguish each target TOO (x-axis) from a comparison TOO (y-axis).

Figure 14 data for selected genomic regions were validated using cfDNA and WBC gDNA. A portion (y-axis) is provided that correctly classifies each of the tos (x-axis).

FIG. 15A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 4. Fig. 15B is a confusion matrix (fusion matrix) that describes the accuracy of cancer type classification for several subjects with cancer using methylation data of the target genomic regions of table 4.

FIG. 16A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 5. Fig. 16B shows the actual cancer types and predicted cancer types for a classifier generated using several genomic regions of table 5.

FIG. 17A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 6. Fig. 17B shows the actual cancer types and predicted cancer types for a classifier generated using several genomic regions of table 6.

FIG. 18A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 7. FIG. 18B is a confusion matrix describing the accuracy of determining cancer type classifications for several subjects using the methylation data of Table 7.

FIG. 19A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 8. Fig. 19B is a confusion matrix describing the accuracy of determining cancer type classifications for several subjects with cancer using the methylation data of table 8.

FIG. 20A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 9. Fig. 20B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of table 9.

FIG. 21A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 10. Fig. 21B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of list 10.

FIG. 22A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 11. FIG. 22B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of Table 11.

FIG. 23A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 12. Fig. 23B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of list 12.

FIG. 24A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 13. Fig. 24B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of list 13.

FIG. 25A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 14. Fig. 25B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of list 14.

FIG. 26A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 15. Fig. 26B is a confusion matrix describing the accuracy of determining cancer type classifications for several subjects with cancer using the methylation data of list 15.

FIG. 27A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of the target genomic regions of Table 16. FIG. 27B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using the methylation data of list 16.

Fig. 28A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of a randomly selected subset of 10% of the target genomic regions of list 12. Fig. 28B is a confusion matrix describing the accuracy of determining cancer type classifications for several subjects with cancer using methylation data of a randomly selected subset of 10% of the target genomic regions of list 12.

Fig. 29A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data of a randomly selected subset of 25% of the target genomic regions of list 12. Fig. 29B is a confusion matrix describing the accuracy of determining cancer type classifications for several subjects with cancer using methylation data of a randomly selected subset of 25% of the target genomic regions of list 12.

Fig. 30A depicts a Receiver Operator Curve (ROC) showing the sensitivity and specificity of cancer detection using methylation data from a randomly selected subset of 50% of the target genomic regions selected from table 4. Fig. 30B is a confusion matrix describing the accuracy of determining the classification of cancer types for several subjects with cancer using methylation data of a randomly selected subset of 50% of the target genomic regions selected from list 4.

Detailed Description

Defining:

unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this description belongs. As used herein, the following terms have the meanings ascribed to them hereinafter.

As used herein, any reference to "one embodiment" or "an embodiment" means that a particular embodiment, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, and thus a framework is provided for the various possibilities of several embodiments described to work together.

As used herein, "comprises," "comprising," "includes," "including," "has," "having," "has," or any other variation thereof, is intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, "or" means an inclusive or (exclusive or) and not an exclusive or (exclusive or). For example, a case a or B is satisfied by any one of: a is true (or present) and B is false (or not present), a is false (or not present) and B is true (or present), and both a and B are true (or present).

Furthermore, the use of "a" or "an" is used to describe elements and components of several embodiments herein. This is done merely for convenience and to give a general sense of the description. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.

As used herein, ranges and amounts can be expressed as "about" a particular value or range. The precise amount is also included. Thus, "about 5 micrograms" means "about 5 micrograms" and also means "5 micrograms". Generally, the term "about" includes an amount that is expected to be within experimental error. In some embodiments, "about" means the number or value designated as "+" or "-" 20%, 10%, or 5%. Further, the recitation of ranges herein is intended to serve as a shorthand method of referring all values within that range, including the recited endpoints. For example, a range of 1 to 50 is understood to include any number, combination of numbers, or subrange from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, and 50.

The term "methylation", as used herein, means the process by which a methyl group is added to a DNA molecule. For example, a hydrogen atom on the pyrimidine ring of a cytosine base may be converted to a methyl group, forming a 5-methylcytosine. The term also refers to the process of adding a hydroxymethyl group to a DNA molecule, for example, by oxidation of the methyl group on the pyrimidine ring of the cytosine base. Methylation and hydroxymethylation tend to occur at the dinucleotides of cytosine and guanine, referred to herein as "CpG sites".

The term "methylation" may also refer to the methylation status of a CpG site. A CpG site having a 5-methylcytosine is methylated. A CpG site having a hydrogen atom in the pyrimidine ring of the cytosine base is unmethylated.

The methylation state of a site, i.e. the presence or absence of a methyl group, should also be covered. Wherein the presence of a methyl group is a methylated site/the absence of a methyl group is an unmethylated site or an unmethylated site.

In such embodiments, as is well known in the art, the wet laboratory assay used to detect methylation may differ from that described herein.

The term "methylation site," as used herein, refers to a region of a DNA molecule to which a methyl group can be added. "CpG" sites are the most common sites for methylation, but methylation sites are not limited to CpG sites. For example, DNA methylation can occur at cytosines in CHG and CHH, where H is adenine, cytosine, or thymine. Cytosine methylation of the 5-hydroxymethylcytosine form and its characteristics can also be assessed using the methods and procedures disclosed herein (see, e.g., WO 2010/037001 and WO 2011/127136, incorporated herein by reference).

The term "CpG site" is used herein to mean a region of a DNA molecule in which a cytosine nucleotide is followed by a guanine in a linear sequence of several bases in the 5 'to 3' direction of the sequence. "CpG" is a shorthand for 5 '-C-phospho-G-3', 5 '-C-phospho-G-3' being cytosine and guanine separated by only one phosphate group. Cytosine in CpG dinucleotides can be methylated to form 5-methylcytosine.

The term "CpG detection site" as used herein means a region in a probe configured to hybridize to a CpG site of a target DNA molecule. The CpG sites on the target DNA molecule may include cytosine and guanine separated by a phosphate group, where cytosine is methylated or unmethylated. The CpG sites on the target DNA molecule may include uracil and guanine separated by a phosphate group, wherein the uracil is generated by conversion of unmethylated cytosine.

The term "UpG" is shorthand for 5 '-U-phospho-G-3', 5 '-U-phospho-G-3' being uracil and guanine separated by only one phosphate group. UpG can result from a bisulfite treatment of DNA that converts unmethylated cytosines to uracil. Cytosine can be converted to uracil by other methods known in the art, such as chemical modification, synthesis, or enzymatic conversion.

The terms "hypomethylation" or "hypermethylation", as used herein, refer to the state of methylation of a DNA molecule containing multiple (e.g., more than 3, 4, 5, 6, 7, 8, 9, 10, etc.) CpG sites, wherein a high proportion (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage in the range of 50% to 100%) of the CpG sites are unmethylated or methylated, respectively.

The terms "methylation state vector" or "methylation state vector" as used herein, refer to a vector comprising a plurality of elements, wherein each element, in the order of occurrence of a methylation site in a DNA molecule from 5 'to 3', indicates the methylation state of a methylation site in a DNA molecule comprising a plurality of methylation sites. For example,<Mx,Mx+1,Mx+2>、<Mx,Mx+1,Ux+2>、...、<Ux,Ux+1,Ux+2>there may be several methylation vectors of a DNA molecule comprising three methylation sites, wherein M represents a methylated site and U represents an unmethylated methylation site.

The terms "abnormal methylation pattern" or "abnormal methylation pattern" as used herein, refer to a methylation pattern or a methylation state vector of a DNA molecule that is expected to be found in a sample less frequently than a threshold in a non-cancer or healthy sample. In one embodiment provided herein, the predictability (expectedness) of finding a particular methylation state vector in a health control group comprising a number of healthy individuals is represented by the p-value. A low p-value score generally corresponds to a methylation state vector that is less expected than other methylation state vectors in a sample from a healthy individual. A high p-value score generally corresponds to a methylation state vector that is more expected than other methylation state vectors in samples from healthy individuals in the health control group. A methylation state vector having a value below a threshold (e.g., 0.1, 0.01, 0.001, 0.0001, etc.) can be defined as an abnormal (abnormal)/abnormal (anomalous) methylation pattern. Various methods known in the art can be used to calculate a p-value or predictability of a methylation pattern or a methylation state vector. The exemplary methods provided herein involve the use of a Markov chain probability that assumes that the methylation status of a CpG site is dependent on the methylation status of neighboring CpG sites. The alternative method provided herein calculates the expected value of a particular methylation state vector observed in healthy individuals by applying a mixture model comprising a plurality of mixture components, each component being an independent site model in which methylation at each CpG site is assumed to be independent of methylation state at other CpG sites.

The term "cancer sample" as used herein means a sample comprising genomic DNA from a body diagnosed with a cancer. The genomic DNA may be, but is not limited to, cfDNA fragments or chromosomal DNA from a subject with a cancer. The genomic DNA may be sequenced and its methylation status may be assessed by methods known in the art, bisulfite sequencing. When genomic sequences are obtained from public databases (e.g., cancer genomic map (TCGA)) or experimentally obtained by sequencing a genome of an individual diagnosed with a cancer, a cancer sample may mean genomic DNA or cfDNA fragments having the genomic sequences. The term "cancer samples" as used herein means a plurality of samples, including genomic DNA from a plurality of individuals, each individual diagnosed as having a cancer. In various embodiments, several cancer samples from more than 100, 300, 500, 1000, 2000, 5000, 10000, 20000, 40000, 50000 or more individuals diagnosed with cancer are used.

The term "non-cancer sample" as used herein means a sample that includes genomic DNA from a body that is not diagnosed as having a cancer. The genomic DNA may be, but is not limited to, cfDNA fragments or chromosomal DNA from a subject without a cancer. The genomic DNA may be sequenced and its methylation status may be assessed by methods known in the art, such as bisulfite sequencing. When genomic sequences are obtained from public databases (e.g., cancer genomic map (TCGA)) or experimentally obtained by sequencing a genome of an individual without a cancer, a non-cancer sample can mean genomic DNA or cfDNA fragments having the genomic sequences. The term "plurality of non-cancer samples" as a plurality means a plurality of samples, including genomic DNA from a plurality of individuals, each individual diagnosed as not having cancer. In various embodiments, healthy samples from more than 100, 300, 500, 1000, 2000, 5000, 10000, 20000, 40000, 50000 or more individuals diagnosed as not having cancer are used.

The term "training sample", as used herein, means a sample used to train a classifier and/or select one or more genomic regions for cancer detection or to detect a source cancer tissue or cancer cell type as described herein. The training sample may include genomic DNA or modifications thereof from one or more healthy subjects or from one or more subjects having a disease condition (e.g., cancer, a particular type of cancer, a particular stage of cancer, etc.). The genomic DNA may be, but is not limited to, several cfDNA fragments or chromosomal DNA. The genomic DNA may be sequenced and its methylation status may be assessed by methods known in the art, such as bisulfite sequencing. When genomic sequences are obtained from public databases (e.g., The Cancer Genome Atlas (TCGA)) or experimentally obtained by sequencing a Genome of an individual, a training sample may mean genomic DNA or cfDNA fragments having The genomic sequences.

The term "test sample", as used herein, means a sample from a subject whose health condition has been or will be detected using a classifier and/or a test combination described herein. The test sample may comprise genomic DNA or a modification thereof. The genomic DNA may be, but is not limited to, several cfDNA fragments or chromosomal DNA.

The term "genomic region of interest" as used herein means a region in a genome that is selected for analysis in a test sample. An assay detection assembly is generated having a plurality of probes designed to hybridize to (and optionally pull down) a plurality of nucleic acid fragments derived from the target genomic region or a fragment of the target genomic region. A nucleic acid fragment derived from the genomic region of interest means a nucleic acid fragment generated by degradation, cleavage, bisulfite conversion, or other treatment of DNA from the genomic region of interest.

Various genomic regions of interest are described in terms of their chromosomal location in the sequence listing filed herewith. Chromosomal DNA is double-stranded, and thus a target genomic region includes two DNA strands: one having the sequence provided in the list, and a second strand that is an opposite complementary strand of the sequence in the list. Probes can be designed to hybridize to one or both sequences. Optionally, the probe is hybridized to a converted sequence from, for example, treatment with sodium bisulfite.

The term "off-target genomic region" as used herein means a region in a genome that is not selected for analysis in a test sample, but has sufficient homology to a target genomic region, potentially designed to be ligated to and pulled down by a probe directed against the target genomic region. In one embodiment, an off-target genomic region is a genomic region that is aligned with a probe along at least 45 bases with at least 90% identity.

By "converted cfDNA molecules", "converted cfDNA molecules" and "processed modified fragments obtained from the cfDNA molecules" is meant DNA molecules obtained by processing DNA or cfDNA molecules in a sample in order to resolve methylated and unmethylated nucleotides in the DNA or cfDNA molecules. For example, in one embodiment, the sample may be treated with bisulfite ions (e.g., using sodium bisulfite) to convert unmethylated cytosine ("C") to uracil ("U"), as is well known in the art. In another embodiment, conversion of unmethylated cytosine to uracil is accomplished using an enzymatic conversion reaction, for example, using a cytidine deaminase (APOBEC). After processing, the converted DNA molecules or cfDNA molecules include additional uracil that was not present in the original cfDNA sample. Replication of a DNA strand comprising a uracil by DNA polymerase results in the addition of adenine to the new complementary strand, rather than guanine, which normally is the complement of cytosine or methylcytosine.

By "cell-free nucleic acid", "cell-free DNA" or "cfDNA" is meant a nucleic acid fragment that circulates within the body (e.g., the bloodstream) of a subject and is derived from one or more healthy cells and/or from one or more cancer cells. In addition, cfDNA can be from other sources such as viruses, fetuses, and the like.

The terms "circulating tumor DNA" or "ctDNA" or the like refer to nucleic acid fragments derived from tumor cells that may be released into the blood stream of an individual as a result of a biological process, such as apoptosis or necrosis of dying cells, or actively by surviving tumor cells.

The term "fragment", as used herein, may refer to a fragment of a nucleic acid molecule. For example, in one embodiment, a fragment may refer to a cfDNA molecule in a blood or plasma sample, or a cfDNA molecule extracted from a blood or plasma sample. An amplified product of a cfDNA molecule may also be referred to as a "fragment". In another embodiment, the term "fragment" as described herein means a sequence read, or a group of sequence reads, that has been processed for subsequent analysis (e.g., in machine learning-based classification). For example, as is well known in the art, raw sequence reads can be aligned to a reference genome and the matched paired end sequence reads assembled into a longer fragment for subsequent analysis.

The term "subject" means a human subject. The term "healthy individual" means an individual who is assumed not to have a cancer or disease.

The term "subject" means an individual whose DNA is analyzed. A subject may be a test subject whose DNA is evaluated using a targeted test combination as described herein to assess whether the person has a cancer or other disease. A subject may also be a member of a control group that is known not to have a cancer or other disease. A subject may also be a member of a group of cancers or other diseases known to have a cancer or other disease. Control and cancer/disease groups can be used to aid in the design or validation of the targeted detection combination.

The term "sequence read," as used herein, means a nucleotide sequence read from a sample. Sequence reads can be obtained via various methods provided herein or known in the art.

The term "sequencing depth" as used herein means a count of the number of times a given target nucleic acid in a sample is sequenced (e.g., a count of sequence reads at a given target region). Increasing the depth of sequencing can reduce the amount of nucleic acid required to assess a disease state (e.g., the state of a cancer or cancer-derived tissue).

The terms "tissue of origin" or "TOO", as used herein, refer to an organ, group of organs, body region, or cell type from which a cancer arises or originates. Identification of a source tissue or cancer cell type typically allows identification of the most appropriate next step in the continuous care (care) of cancer for further diagnosis, staging and decision on treatment.

"transition" generally means that the base composition changes from one purine to another purine, or from one pyrimidine to another pyrimidine. By way of example, the following changes are transitions: c → U, U → C, G → A, A → G, C → T and T → C.

"a whole of probes" or "a whole of polynucleotide-containing (polynucleotide-contacting) probes" of a detection assembly or decoy generally means all probes delivered with a particular detection assembly or decoy. For example, in some embodiments, a detection combination or decoy set can include (1) several probes having the characteristics specified herein (e.g., several probes for linking to cell-free DNA fragments corresponding to or derived from genomic regions set forth herein in one or more lists) and (2) additional probes that do not contain such characteristic(s). The probes of a test combination generally refer to all probes delivered with the test combination or decoy, including probes that do not contain the specified feature(s).

Cancer assay detection combination:

in a first aspect, the present description provides a cancer assay detection assembly comprising a plurality of probes or a plurality of probe pairs. The several assay detection combinations described herein may alternatively be referred to as several decoy sets, or as several compositions comprising several decoy oligonucleotides. The plurality of probes can be a plurality of polynucleotide-containing probes specifically designed to target one or more genomic regions that are differentially methylated between cancer and non-cancer samples, between different cancer-derived Tissue (TOO) types, between different cancer cell types, between samples at different stages of cancer, as identified by the methods provided herein. In some embodiments, the number of target genomic regions (or nucleic acids derived from the number of target genomic regions) are selected to maximize classification accuracy subject to a size budget (size budget) (determined by the sequencing budget and desired sequencing depth).

To design a cancer assay detection combination, the analysis system may collect samples corresponding to various outcomes under consideration, e.g., samples known to have cancer, samples deemed healthy, samples from tissues of known origin, etc. The source of cfDNA and/or ctDNA used to select a genomic region of interest may vary for analytical purposes. For example, different sources may be required for assays aimed at detecting general cancers, specific types of cancers, stages of cancer, or tissues of origin. These samples can be obtained by Whole Genome Bisulfite Sequencing (WGBS) or from public databases such as TCGA. The analysis system may be any general purpose computing system having a computer processor and a computer readable storage medium having instructions for executing the computer processor to perform any or all of the operations described herein.

The analysis system can then select a genomic region of interest based on the methylation pattern of the nucleic acid fragment. One approach takes into account the pairwise cocoa resolution between the number of regions (or more specifically CpG sites within a region) versus the results. Another approach is to consider the resolution of a region (or more specifically the CpG sites within a region) when considering each result versus the remaining results. From a selected genomic region of interest with high discriminatory power, the analysis system can design several probes to target fragments from the selected genomic region. The analysis system may generate different sized cancer assay detection combinations, for example, a small sized cancer assay detection combination including probes targeting the most informative genomic region, a medium sized cancer assay detection combination including probes from a small sized cancer assay detection combination and additional probes targeting the second tier informative genomic region, a large sized cancer assay detection combination including probes from small and medium sized cancer assay detection combinations, and more probes targeting the third tier informative genomic region. Using data obtained from such cancer assay detection combinations (e.g., the methylation state of nucleic acids from the cancer assay detection combinations), the analysis system may train classifiers using various classification techniques to predict the likelihood that a sample has a particular outcome or state, such as: cancer, specific cancer types, other symptoms, other diseases, and the like.

An exemplary method for designing a cancer assay detection combination is generally depicted in fig. 2. For example, to design a cancer assay detection set, an analysis system may collect information on the methylation status of several CpG sites of several nucleic acid fragments from several samples corresponding to various outcomes under consideration, e.g., samples known to have cancer, samples considered healthy, samples from tissues of known origin, etc. These samples can be processed (e.g., with Whole Genome Bisulfite Sequencing (WGBS)) to determine the methylation status of several CpG sites, or the information can be obtained from TCGA. The analysis system may be any general purpose computing system having a computer processor and a computer-readable storage medium having instructions for executing the computer processor to perform any or all of the operations described in the present disclosure.

The analysis system can then select a genomic region of interest based on the methylation patterns of the several nucleic acid fragments. One way considers the pair-wise resolvability between the number of regions (or more specifically, CpG sites) versus the results. Another approach is to consider the resolvability of a region (or more specifically, several CpG sites) when considering each outcome versus the remaining outcomes. From a selected genomic region of interest with high resolution, the analysis system can design several probes to target fragments from the selected genomic region. The analysis system may generate different sized cancer assay detection combinations, e.g., a small sized cancer assay detection combination comprising probes targeting the most informative genomic region, a medium sized cancer assay detection combination comprising probes from the small sized cancer assay detection combination and additional probes targeting the second tier informative genomic region, a large sized cancer assay detection combination comprising probes from the small and medium sized cancer assay detection combinations, and more probes targeting the third tier informative genomic region. Using data obtained from such cancer assay detection combinations, the analysis system may train classifiers using various classification techniques to predict the likelihood that a sample has a particular outcome or state, such as: cancer, specific cancer types, other symptoms, other diseases, and the like.

In some embodiments, the cancer assay detection combination comprises at least 500 pairs of probes, wherein each of the at least 500 pairs comprises two probes configured to overlap with each other by an overlapping sequence, wherein the overlapping sequence comprises at least 30 nucleotides, and wherein each probe is configured to hybridize to the same strand of an (optionally converted) DNA molecule (e.g., a cffDNA molecule) corresponding to one or more genomic regions. In some embodiments, each of the several genomic regions comprises at least five methylation sites, and wherein the at least five methylation sites have an aberrant methylation pattern in the cancer sample, or have different methylation states between samples of different tos. For example, in one embodiment, the at least five methylation sites are differentially methylated between cancer and non-cancer samples, or between one or more pairs of samples of cancer from different source tissues. In some embodiments, each pair of probes includes a first probe and a second probe, wherein the second probe is different from the first probe. The second probe may overlap with the second probe by an overlapping sequence that is at least 30, at least 40, at least 50, or at least 60 nucleotides in length.

The number of target genomic regions may be selected from any one of lists 1 to 16 (table 1). In some embodiments, the cancer assay detection combination comprises a number of probes, wherein each of the number of probes is configured to hybridize to a converted cfDNA molecule corresponding to one or more genomic regions in any of lists 1-16. In some embodiments, the several different decoy oligonucleotides are configured to hybridize to at least 20% of several target genomic regions derived from any one of lists 1 to 16. In some embodiments, the number of different decoy oligonucleotides are configured to hybridize to a number of DNA molecules that are at least 30%, 40%, 50%, 60%, 70%, or 80% of the number of target genomic regions derived from any one of lists 1-16.

The several genomic regions of interest may be selected from list 1. The several genomic regions of interest may be selected from list 2. The several genomic regions of interest may be selected from table 3. The several genomic regions of interest may be selected from table 4. The several genomic regions of interest may be selected from table 5. The several genomic regions of interest may be selected from list 6. The several genomic regions of interest may be selected from table 7. The number of genomic regions of interest may be selected from table 8. The several genomic regions of interest may be selected from list 9. The several genomic regions of interest may be selected from list 10. The several genomic regions of interest may be selected from list 11. The several genomic regions of interest may be selected from list 22. The several genomic regions of interest may be selected from list 13. The several genomic regions of interest may be selected from list 14. The several genomic regions of interest may be selected from list 15. The several genomic regions of interest may be selected from list 16.

Because the plurality of probes are configured to hybridize to DNA or cfDNA molecules corresponding to or derived from one or more genomic regions, the plurality of probes may have a sequence different from the genomic region of interest upon conversion. For example, a DNA containing an unmethylated CpG site would be converted to include UpG instead of CpG because unmethylated cytosines are converted to uracil by a conversion reaction (e.g., bisulfite treatment). Thus, a probe is configured to hybridize to a sequence that includes UpG, but not the unmethylated CpG normally present. Thus, a complementary site to the unmethylated site in the probe may include CpA rather than CpG, and some probes for a hypomethylated site where all methylated sites are unmethylated may not have guanine (G) bases. In some embodiments, at least 3%, 5%, 10%, 15%, or 20% of the probes do not have CpG sequences.

The cancer assay detection combination may be used to detect the presence or absence of cancer as a whole and/or to provide a classification of cancer, such as a type of cancer, stage of cancer, for example: stage one, second, third or fourth, or providing a TOO believed to be the origin of the cancer. The detection set may include probes that target genomic regions that are differentially methylated between normal cancer (pan-cancer) and non-cancer samples, or only differentially methylated in cancer samples with a particular cancer type (e.g., a lung cancer specific target). For example, in some embodiments, a cancer assay detection assembly is designed to include differentially methylated genomic regions based on converted (e.g., bisulfite) sequencing data generated from cfDNA of cancer and non-cancer individuals.

Each probe (or pair of probes) can be designed to target one or several target genomic regions. The several genomic regions of interest are selected based on several criteria (criteria) designed to increase selective enrichment of informative nucleic acid fragments while reducing noise and non-specific binding.

In an embodiment, a detection combination can include a number of probes that can selectively bind to and enrich for a number of cfDNA fragments that are differentially methylated in a cancer sample. In this case, sequencing of several enriched fragments can provide information relevant to the detection of cancer. Further, in some embodiments, the probes (or portions thereof) are designed to target genomic regions of interest that are determined to have an aberrant methylation pattern in a cancer sample, or from a particular cancer type, tissue type, or cell type. In one embodiment, probes are designed to target genomic regions that are determined to be hypermethylated or hypomethylated in a particular cancer or cancer type to provide additional selectivity and specificity of detection. In some embodiments, a detection combination includes a plurality of probes that target a plurality of hypomethylated fragments. In some embodiments, a detection combination includes several probes that target several hypermethylated fragments. In some embodiments, a detection combination comprises a first set of probes for hypermethylated fragments and a second set of probes for hypomethylated fragments. In some embodiments, a cancer assay detection set includes not only probes designed to target regions with a first methylation state (e.g., hypomethylation), but also probes designed to hybridize to the same target region with an opposite methylation state (e.g., hypermethylation). The hypomethylated and hypermethylated fragments that target several probes to the same region can be referred to as "binary" targeting (see information in the sequence listing) (fig. 1C). In some embodiments, the ratio between the plurality of probes targeting the first set of plurality of hypermethylated fragments and the plurality of probes targeting the second set of plurality of hypomethylated fragments (hypermethylation: hypomethylation ratio) is between 0.4 and 2, between 0.5 and 1.8, between 0.5 and 1.6, between 0.5 and 1.0, between 1.4 and 1.6, between 1.2 and 1.4, between 1 and 1.2, between 0.8 and 1, between 0.6 and 0.8, or between 0.4 and 0.6. Methods of identifying genomic regions, i.e., genomic regions that produce differentially methylated DNA molecules (or aberrantly methylated DNA molecules), between cancer and non-cancer samples, between different cancer-derived Tissue (TOO) types, between different cancer cell types, or between samples of different stages of cancer are provided in detail herein, as well as methods of identifying aberrantly methylated DNA molecules or fragments identified as indicative of cancer are also provided in detail herein.

In a second example, genomic regions may be selected when they produce abnormally methylated DNA molecules in a cancer sample or a sample with a known tissue of cancer origin (TOO) type. For example, as described herein, a mackoff model trained on a set of non-cancer samples can be used to identify genomic regions of DNA molecules that produce aberrant methylation (i.e., DNA molecules having a methylation pattern below a p-value threshold).

Each of the plurality of probes may target a genomic region comprising at least 30bp (base pair), 35bp, 40bp, 45bp, 50bp, 60bp, 70bp, 80bp, 90bp, 100bp or more. In some embodiments, the several genomic regions may be selected to have fewer than 30, 25, 20, 15, 12, 10, 8, or 6 methylation sites.

In some examples, the plurality of genomic regions may be selected when at least 80, 85, 90, 92, 95, or 98% of the at least five methylated (e.g., CpG) sites in the region are methylated or unmethylated in a non-cancer or cancer sample or a cancer sample from a cancer-derived Tissue (TOO).

Genomic regions may be further filtered based on their methylation patterns to select only genomic regions that are likely to provide information, for example, based on CpG sites that are differentially methylated between cancer and non-cancer samples (e.g., abnormally methylated or unmethylated in cancer relative to non-cancer), between cancer samples of one TOO and cancer samples of a different TOO, only differentially methylated in cancer samples of one TOO. For the selection, calculations may be performed for each CpG or several CpG sites. For example, a first count is determined as the number of cancer-containing samples that include a segment that overlaps with the CpG (cancer _ count), and a second count is determined as the total number of samples that include a segment that overlaps with the CpG site (sum). Several genomic regions can be selected based on criteria that positively correlate to the count of cancer-containing samples that include a segment indicative of cancer that overlaps with the CpG site (cancer _ count) and negatively correlate to the total number of samples that include a segment indicative of cancer that overlaps with the CpG site (total). In one embodiment, the number of non-cancer samples (ncancer) and the number of cancer samples (ncancer) with a fragment overlapping a CpG site are calculated. The probability that a sample is cancerous is then estimated, for example, as (n cancer +1)/(n cancer + n non-cancer + 2). This principle applies to other results as well.

Several CpG sites scored by this metric (metric) are ranked and greedily added to a detection combination until the detection combination size budget is exhausted. The procedure for selecting several genomic regions indicative of cancer is further detailed herein. In some embodiments, different ones of the target regions may be selected according to whether the assay is intended to be a multi-cancer assay (pan-cancer assay) or a single-cancer assay (single-cancer assay), or depending on the flexibility required in selecting which CpG sites contribute to the detection combination. A detection set for detecting a particular cancer type can be designed using a similar process. In this example, for each cancer type, and for each CpG site, the information gain is calculated to determine whether to include a probe for that CpG site. The information gain may be calculated for several samples of a given cancer with a TOO as compared to all other samples. For example, consider two random variables, "AF" and "CT". "AF" is a binary variable that indicates whether there is an aberrant segment (yes or no) overlapping a particular CpG site in a particular sample. "CT" is a binary random number that indicates whether a cancer is of a particular type (e.g., lung cancer or a cancer other than lung). Given "AF", mutual information (mutual information) about "CT" can be calculated. That is, if it is known whether an abnormal fragment overlaps with a specific CpG site, how many bits of information regarding the type of cancer (e.g., lung cancer or cancer other than lung) will be obtained. This can be used to rank several cpgs based on how lung-specific they are. This procedure was repeated for several cancer types. If a particular region is only differentially methylated in lung cancer (and not other cancer types or non-cancers), several cpgs in that region will tend to have a high information gain for lung cancer. For each cancer type, several CpG sites are ranked by this information gain metric, then greedily added to a detection combination until the size budget for that cancer type is exhausted.

Further filtering may be performed to select a number of probes that have high specificity (i.e., high binding efficiency) for enrichment of nucleic acids derived from a number of genomic regions of interest. Several probes may be filtered to reduce non-specific binding (or off-target binding) to nucleic acids derived from non-target genomic regions. For example, a plurality of probes can be filtered to select only those probes that have off-target binding events less than a set threshold. In one embodiment, probes may be aligned to a reference genome (e.g., a human reference genome) to select probes that are aligned across the genome to less than a set threshold region. For example, several probes can be selected to align to less than 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, or 8 off-target regions across the reference genome. In other cases, filtering is performed to remove the plurality of genomic regions when the sequence of the plurality of genomic regions of interest occurs more than 5 times, 10 times, 15 times, 20 times, 21 times, 22 times, 23 times, 24 times, 25 times, 26 times, 27 times, 28 times, 29 times, 30 times, 31 times, 32 times, 33 times, 34 times, or 35 times in a genome. When a probe sequence or set of probe sequences that is 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% homologous to a plurality of target genomic regions occurs less than 25 times, 24 times, 23 times, 22 times, 21 times, 20 times, 19 times, 18 times, 17 times, 16 times, 15 times, 14 times, 13 times, 12 times, 11 times, 9 times, or 8 times in a reference genome, further filtering can be performed to select a plurality of target genomic regions, or when the probe sequence or set of probe sequences designed to enrich for a target genomic region is 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% homologous to the plurality of target genomic regions, occurs more than 5 times, 10 times, 15 times, 20 times, 21 times, 22 times, 23 times, 24 times, 25 times, 26 times, in a reference genome, 27, 28, 29, 30, 31, 32, 33, 34 or 35 times, is performed to remove the number of target genomic regions. This is to exclude repeated probes that may pull down several off-target fragments, which are undesirable and may impact assay efficiency.

In some embodiments, a fragment-probe overlap of at least 45bp is demonstrated to be effective to achieve a non-negligible amount of pull-down as provided in example 1 (although one of ordinary skill in the art will appreciate that this number is variable). In some embodiments, more than 10% mismatch (mismatch) between the probe and several fragment sequences in the region of overlap is sufficient to substantially disrupt the linkage, and thus the pull-down efficiency. Thus, several sequences that can align to the probe along at least 45bp with at least 90% match rate can be candidates for off-target pull down. Thus, in one embodiment, the number of such several regions is scored. The best probes have a score of 1, meaning that they are paired in only one place (at the intended target area). Several probes with an intermediate score (i.e., less than 5 or 10) may be acceptable in some cases, and in some cases, any probes above a particular score are discarded. Other cutoff values may be used for a particular number of samples.

Once the probes are hybridized and capture DNA fragments corresponding to or derived from a target genomic region, the hybridized probe-DNA fragment intermediates (probe-DNA fragments intermediates) are pulled down (or isolated) and the target DNA is amplified and sequenced. The sequence reads provide information associated with detection of cancer. For this purpose, a detection assembly is designed to include several probes that can capture several fragments that together provide information relevant to the detection of cancer. In some embodiments, a detection combination comprises at least 500, 1000, 2000, 2500, 5000, 6000, 7500, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 50000, 60000, 70000, or 80000 pairs of probes. In other embodiments, a detection combination comprises at least 1000, 2000, 5000, 10000, 12000, 15000, 20000, 30000, 40000, 50000, 100000, 200000, 250000, 300000, 400000, 500000, 550000, 600000, 700000, or 800000 probes. The plurality of probes may collectively comprise at least 0.2 million, 0.4 million, 0.6 million, 0.8 million, 1 million, 2 million, 3 million, 4 million, 5 million, 6 million, 7 million, 8 million, 9 million, 1 thousand 2 million, 1 thousand 4 million, 1 thousand 5 million, 2 million, or 2 thousand 5 million nucleotides.

The selected genomic regions may be located at various positions in a genome, including but not limited to exons, introns, intergenic regions, and other portions. Fig. 11. In some embodiments, several probes directed to non-human genomic regions, such as several probes directed to viral genomic regions, may be added.

In some cases, primers may be used to specifically amplify (e.g., by PCR) several targets/biomarkers of interest, thereby enriching (optionally without heterocapture) the sample for the desired several targets/biomarkers. For example, forward and reverse primers can be prepared for each genomic region of interest and used to amplify several fragments corresponding to or derived from the desired genomic region. Thus, while the present disclosure is of particular interest for cancer assay detection combinations and bait sets for heterozygous capture, the present disclosure is broad enough to encompass other methods for enrichment of cell-free DNA. Thus, a skilled artisan, having the benefit of this disclosure, will recognize that several methods similar to those described herein in connection with hybrid capture can alternatively be accomplished by replacing the hybrid capture with other enrichment strategies, such as PCR amplification of cell-free DNA fragments corresponding to several genomic regions of interest. In some embodiments, bisulfite Padlock Probe capture (bisultate Padlock Probe capture) is used to enrich several regions of interest, as described in Zhang et al (US 2016/0340740). In some embodiments, additional or alternative methods are used for enrichment (e.g., non-targeted enrichment), such as simplified representation bisulfite sequencing (reduced representation bissulfite sequencing), methylated restriction enzyme sequencing, methylated DNA immunoprecipitation sequencing, methyl CpG binding domain protein sequencing, methyl DNA capture sequencing, or microdroplet PCR.

And (3) probe:

a cancer assay detection combination (alternatively referred to as a "decoy") provided herein is a detection combination that includes a set of hybrid probes (also referred to herein as "probes") designed to target and pull down several nucleic acid fragments of interest upon enrichment for use in the assay. In some embodiments, the plurality of probes are DNA or cfDNA molecules designed to hybridize and enrich for DNA from a plurality of cancer samples, treated to convert unmethylated cytosines (C) to uracil (U). In other embodiments, the plurality of probes are configured to hybridize and enrich for DNA or cfDNA molecules from a plurality of cancer samples of a TOO (or a plurality of TOO) treated to convert unmethylated cytosines (C) to uracil (U). The probes may be designed to bind (or hybridize) to a target (complementary) strand of DNA or RNA. The target strand may be a "positive" strand (e.g., a strand that is transcribed into mRNA and subsequently translated into a protein) or a complementary "negative" strand. In a particular embodiment, a cancer assay detection assembly may include an array of two probes, one probe for the positive strand and the other probe for the negative strand of a target genomic region.

For each genomic region of interest, four possible probe sequences can be designed. The DNA molecules of each target region are double-stranded, and thus a probe or set of probes can be directed against either a "positive" or forward "strand, or the reverse complement thereof (the" negative "strand). Furthermore, in some embodiments, the number of probes or number of probe sets are designed to enrich for a number of DNA molecules or a number of fragments that have been treated to convert unmethylated cytosine (C) to uracil (U). Because the plurality of probes or probe sets are designed to enrich for DNA molecules that are converted, correspond to or derived from the plurality of target regions, the sequence of the plurality of probes can be designed (by applying a 'at the position of G' at the position of unmethylated cytosine in the plurality of DNA molecules or plurality of fragments corresponding to or derived from the target regions) to enrich for a plurality of DNA molecules of a plurality of fragments in which unmethylated C 'has been converted to U'. In one embodiment, the plurality of probes are designed to bind to or hybridize to a plurality of DNA molecules or fragments (e.g., hypermethylated or hypomethylated DNA molecules) from a plurality of genomic regions known to contain cancer-specific methylation patterns, thereby enriching for cancer-specific DNA molecules or fragments. Targeting several genomic regions, or several cancer-specific methylation patterns, may advantageously allow for the specific enrichment of DNA molecules or fragments identified as informative for cancer or cancer TOO, and thus, reduce sequencing requirements and sequencing costs. In other embodiments, two probe sequences (one probe per DNA strand) can be designed for each genomic region of interest. In yet another case, probes are designed to enrich for all DNA molecules or fragments (i.e., whether in strand or methylation state) corresponding to or derived from a target region. This may be because the cancer methylation state is not highly methylated or unmethylated, or because the probes are designed to target minor mutations or other variations, other than methylation changes, that similarly indicate the presence or absence of a cancer, or the presence or absence of a cancer of one or more tos. In this case, all four possible probe sequences may be included for each target genomic region.

The number of probes may be from 10, 100, 200, or 300 base pairs in length. The plurality of probes may comprise at least 50, 75, 100 or 120 nucleotides. The number of probes may include less than 300, 250, 200, or 150 nucleotides. In one embodiment, the plurality of probes comprises 100 to 150 nucleotides. In a particular embodiment, the plurality of probes comprises 120 nucleotides.

In some embodiments, the probes are designed in a "2 x tiled" (2x tiled) fashion to cover overlapping portions of a target area. Each probe optionally overlaps at least partially with another probe in the library (library) over the coverage range. In such several embodiments, the detection combination contains multiple pairs of probes, each probe of a pair overlapping the other by at least 25, 30, 35, 40, 45, 50, 60, 70, 75, or 100 nucleotides. In some embodiments, the overlapping sequence may be designed to be complementary to a genomic region of interest (or cfDNA derived from the genomic region of interest), or to a sequence having homology to a region of interest or cfDNA. Thus, in some embodiments, at least two probes are complementary to the same sequence in a target genomic region, and a nucleotide fragment corresponding to or derived from the target genomic region can be ligated and pulled down by at least one of the probes. Other levels of tiling are possible, such as 3x tiling, 4x tiling, etc., where each nucleotide in a target region can bind to more than two probes.

In one embodiment, each base in a target genomic region is overlapped by exactly two probes, as depicted in FIG. 1B. Probes that extend beyond a target genomic region in both directions can be used to pull down cfDNA fragments that comprise a portion of the target genomic region and DNA sequences adjacent to the target genomic region. In some cases, even a relatively small number of target regions can be targeted by three probes (see fig. 1A). A probe set comprising three or more probes can optionally be used to capture a larger genomic region (see fig. 1B). In some embodiments, several sub-combinations of several probes will collectively extend across an entire genomic region (e.g., may be complementary to several fragments from the genomic region, either unconverted or converted). A tiled set of probes optionally includes several probes collectively including at least two probes that overlap with each nucleotide in the genomic region. This is done to ensure that several cfdnas, including a small portion of a target genomic region at one end, will have a substantial overlap with at least one probe extending into an adjacent non-target genomic region to provide efficient capture.

For example, a 100bp cfDNA fragment comprising 30 nucleotides (nt) of the target genomic region may be ensured that at least 65bp overlaps with at least one of several overlapping probes. Other levels of tiling are possible. For example, to increase target size and add more probes in a detection set, several probes can be designed to expand a 30bp target region by at least 70bp, 65bp, 60bp, 55bp, or 50 bp. In order to capture any fragment that overlaps the target region at all (even if by only 1bp), the probes can be designed to extend beyond the ends of the target region on both sides.

The several probes are designed to analyze the methylation status of several genomic regions of interest (e.g., of a human or other organism) suspected of being associated with: the presence or absence of a cancer as a whole, the presence or absence of a particular type of cancer, the stage of cancer, or the presence or absence of other types of disease.

Further, the probes are designed to efficiently bind and pull down cfDNA fragments containing a genomic region of interest. In some embodiments, the plurality of probes are designed to cover overlapping portions of a target area, such that each probe is "tiled" in coverage, while each probe at least partially overlaps in coverage with another probe in the library. In such several embodiments, the detection combination comprises a plurality of pairs of probes, with each pair of probes comprising at least two probes that overlap with each other by an overlapping sequence of at least 25, 30, 35, 40, 45, 50, 60, 70, 75, or 100 nucleotides. In some embodiments, the overlapping sequence may be designed to be complementary to a target genomic region (or a transformed version of a target genomic region), such that a nucleotide fragment derived from or comprising the target genomic region may be bound and pulled down by at least one of the probes. In addition, several probes can be designed to cover both strands of a double-stranded cfDNA sequence.

In one embodiment, the smallest genomic region of interest is 30bp or 31 bp. When a new target region (based on greedy selection as described above) is added to the detection combination, the 30bp new target region can be centered at a particular CpG site of interest. The new object region is then examined to see if each edge of the new object is close enough to several other objects that they can be fused. This is based on a "fusion distance" parameter, which can default to 200bp, but can be adjusted. This allows several target regions close but separate to be enriched with several overlapping probes. The new target may be fused with nothing (increase the number of detection combination targets by one), with only one target, or to the left or to the right (not change the number of detection combination targets), or with existing targets to the left and to the right (decrease the number of detection combination targets by one), depending on whether a target close enough to the left or to the right of the new target exists.

Method for selecting several genomic regions of interest:

in another aspect, methods are provided for detecting cancer and/or target genomic regions of a TOO. The target genomic region may be used to design and fabricate several probes for a cancer assay detection set. The methylation status of DNA or cfDNA molecules corresponding to or derived from the several genomic regions of interest can be screened using the cancer assay detection combination. Alternative methods, such as by WGBS or other methods known in the art, may also be applied to detect the methylation state of a plurality of DNA molecules or fragments corresponding to or derived from the plurality of genomic regions of interest.

Sample treatment:

FIG. 7A is a flowchart of a process 100 for processing a nucleic acid sample and generating methylation state vectors for DNA fragments, according to one embodiment. The method includes, but is not limited to, the following steps. For example, any of the steps of the method may include a quantitative sub-step for quality control, or other laboratory test procedures known to those of ordinary skill in the art.

In step 105, a nucleic acid sample (DNA or RNA) is extracted from a subject. In the present disclosure, DNA and RNA may be used interchangeably unless otherwise indicated. That is, several embodiments described herein may be applicable to nucleic acid sequences of the DNA and RNA types. However, several examples described herein focus on DNA for the sake of brevity and explanation. The sample may be any subcombination of the human genome, including the whole genome. The sample may include blood, plasma, serum, urine, feces, saliva, other types of bodily fluids, or any combination thereof. In some embodiments, methods for drawing a blood sample (e.g., a syringe or finger stick) may be less invasive than procedures for obtaining a biopsy of tissue, which may require surgery. The extracted sample may comprise cfDNA and/or ctDNA. For healthy individuals, the human body can naturally clear cfDNA and other cellular debris. If a subject has a cancer or disease, cfDNA and/or ctDNA in an extracted sample may be present at a detectable level sufficient to detect the cancer or disease.

In step 110, the several cfDNA fragments are treated to convert unmethylated cytosines to uracil. In some embodiments, the method uses a bisulfite treatment of DNA that will not have been treated with bisulfiteMethylated cytosines are converted to uracils without converting methylated cytosines. For example, a commercial kit (kit) such as EZ DNA methylationTMGold (EZ DNA Methylation)TMGold) set, EZ DNA methylationTMTargeting (EZ DNA Methylation)TMDirect) kit or an EZ DNA methylationTMLightning suite (EZ DNA Methylation)TMLightning kit) (available from Zymo Research Corp (gulf, ca)) was used for the bisulfite conversion. In another embodiment, the conversion of unmethylated cytosine to uracil is accomplished using an enzymatic reaction. For example, the conversion can be performed using a commercially available kit for converting unmethylated cytosines to uracil, such as APOBEC-Seq (NEBiolabs, Ipusvie, Mass.).

In step 115, a sequencing library is prepared. In a first step, a ssDNA adaptor (adapter) is added to the 3' -OH end of a bisulfite-converted ssDNA molecule using a ssDNA ligation reaction. In some embodiments, the ssDNA ligation reaction uses CircLigase II (Epicentre) to ligate the ssDNA adaptor to the 3 ' -OH end of a bisulfite-converted ssDNA molecule, where the 5 ' end of the adaptor is phosphorylated and the bisulfite-converted ssDNA is dephosphorylated (i.e., the 3 ' end has a hydroxyl group). In another embodiment, the ssDNA ligation reaction uses a thermostable 5 'AppDNA/RNA ligase (available from new england biological laboratory (ippswich, ma)) to ligate the ssDNA adaptor to the 3' -OH end of the bisulfite-converted ssDNA molecule. In this example, the first UMI adaptor is adenylated at 5 'and blocked at 3'. In another embodiment, the ssDNA ligation reaction uses T4RNA ligase (available from new england biological laboratories) to ligate the ssDNA adaptors to the 3' -OH ends of the bisulfite-converted ssDNA molecules. In a second step, a second strand of DNA is synthesized in an extension reaction. For example, an extension primer that hybridizes to a primer sequence included in the ssDNA adaptor is used in a primer extension reaction to form a double-stranded bisulfite-converted DNA molecule. Optionally, in one embodiment, the extension reaction uses an enzyme that is capable of reading through a number of uracil residues in the bisulfite-converted template strand. Optionally, in a third step, a dsDNA adaptor is added to the double stranded bisulfite converted DNA molecules. Finally, the double stranded bisulfite converted DNA was amplified to add several adapters. For example, PCR amplification using a forward primer comprising a P5 sequence and a reverse primer comprising a P7 sequence was used to add P5 and P7 sequences to the bisulfite converted DNA. Alternatively, a Unique Molecular Identifier (UMI) may be added to the several nucleic acid molecules (e.g., DNA molecules) via adaptor ligation at the time of library preparation. The several UMIs are short nucleic acid sequences (e.g., 4 to 10 base pairs) that are added to several ends of several DNA fragments upon adaptor ligation. In some embodiments, UMI is a number of degenerate (degenerate) base pairs as a unique tag that can be used to identify a number of sequence reads derived from a particular DNA fragment. In PCR amplification after adaptor ligation, the several UMIs are replicated along with the ligated DNA fragments, providing a method to identify several sequence reads from the same original fragment in downstream analysis.

In step 120, several DNA sequences of interest may be enriched from the library. This is for example used when a target detection combinatorial assay is performed on several samples. Upon enrichment, several hybrid probes (also referred to herein as "probes") were used to target and pull down several nucleic acid fragments that provide information on: the presence or absence of cancer (or disease), cancer status, or a classification of cancer (e.g., type of cancer or tissue of origin). For a given workflow, the probes may be designed to bind (or hybridize) to DNA or RNA of a target (complementary) strand. The target strand may be a "positive" strand (e.g., a strand that is transcribed into mRNA, and then translated into a protein) or a complementary "negative" strand. The length of the several probes may be in the range of 10s, 100s or 1000s base pairs. In addition, the probes may cover overlapping portions of a target area.

After a shuffling in step 120, the shuffled nucleic acid fragments are captured and may also be amplified using PCR (enrichment 125). For example, the plurality of target sequences can be enriched to obtain a plurality of enriched sequences, which can then be sequenced. In general, any method known in the art can be used to isolate and enrich for target nucleic acids that are hybridized to the probe. For example, as is well known in the art, a biotin moiety may be added to the 5' ends of the probes (i.e., biotinylated) using a streptavidin-coated (streptavidin-coated) surface (e.g., streptavidin-coated beads) to facilitate isolation of target nucleic acids hybridized to the probes.

In step 130, a number of sequence reads are generated from the number of enriched DNA sequences, e.g., a number of enriched sequences. Sequencing data can be obtained from the several enriched DNA sequences by methods known in the art. For example, the methods may include Next Generation Sequencing (NGS) techniques, including synthesis technology (Illumina), pyrophosphate sequencing (pyrosequencing) (454 life science), Ion semiconductor technology (Ion Torrent sequencing), single molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), Nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing (paired-end sequencing). In some embodiments, massively parallel sequencing is performed using synthetic sequencing with reversible dye terminators.

In step 140, a plurality of methylation state vectors are generated from the plurality of sequence reads. To do so, a sequence read is aligned to a reference genome. The reference genome assists in providing a context of where the fragment cfDNA originates from a human genome. In a simplified example, the sequence reads are aligned such that three CpG sites are associated with CpG sites 23, 24 and 25. (optional reference identifiers used for convenience of description). After alignment, there is information on both: the methylation status of all CpG sites on the cfDNA fragment, and to which position in the human genome the several CpG sites map. With the methylation status and location, a methylation status vector can be generated for the fragment cfDNA.

Generation of the data structure:

FIG. 3A is a flow diagram describing a process 300 for generating a data structure for a health control group according to one embodiment. To create a health control group data structure, the analysis system obtains information about the methylation status of several CpG sites on several sequence reads derived from several DNA molecules or several fragments from several healthy subjects. The methods provided herein to create a health control group data structure may be similarly performed on subjects with cancer, subjects with cancer of a TOO, subjects with subtended cancer of a known cancer type, or subjects with another known disease state. A methylation state vector is generated for each DNA molecule or fragment, for example, by the process 100.

The analysis system subdivides 310 the methylation state vector for each cfDNA fragment into several strings of CpG sites (strings). In one embodiment, the analysis system subdivides 310 the methylation state vector so that the resulting strings are all less than a given length. For example, a methylation state vector of length 11 can be subdivided into strings, with lengths less than or equal to 3 resulting in 9 strings of length 3, 10 strings of length 2, and 11 strings of length 1. In another example, a length 7 methylation state vector being subdivided into strings of length less than or equal to 4 would result in 4 length 4 strings, 5 length 3 strings, 6 length 2 strings, and 7 length 1 strings. If the methylation state vector generated from a DNA fragment is shorter than or the same length as the particular string length, the methylation state vector can be converted to a single string containing all CpG sites of the vector.

The analysis system calculates for each possible CpG site and methylation state in the vectorPossibly, there are several strings recorded (tallies)320 with the specific CpG site as the first CpG site in the string and the number of possible strings with methylation status present in the control group. For example, there are 2^3 or 8 possible string configurations for a string of length 3 at a given CpG site. For each CpG site, the analysis system records 320 how many times each possible methylation state vector occurs in the control group. This may involve recording the following quantities for each starting CpG site in the reference genome:<Mx,Mx+1,Mx+2>、<Mx,Mx+1,Ux+2>、...、<Ux,Ux+1,Ux+2>. The analysis system creates 330 a data structure that stores the brokerage count for each starting CpG site and likelihood of strings at each starting CpG.

There are several benefits to setting an upper limit on the string length. First, depending on the maximum length of a string, the size of the data structure created by the analysis system may increase substantially. For example, a maximum string length of 4 indicates that a maximum of 2^4 digits per CpG need to be recorded. Increasing the maximum string length to 5 would double the possible number of methylation states to record. Reducing the string size helps reduce the computational and data storage burden of the data structure. In some embodiments, the string size is 3. In some embodiments, the string size is 4. A second reason to limit the maximum string size is to avoid overfitting the downstream model. If a long CpG string does not have a strong biological effect on the outcome (e.g. an abnormality prediction predicting the presence of cancer), calculating the probability based on a long string of CpG sites may be problematic because it requires a large amount of data that may not be available and therefore would be too sparse (sparse) for a model to function properly. For example, counting the likelihood that an abnormality/cancer that is conditioned on the previous 100 CpG sites would require a count of several strings of a data structure of length 100, ideally, some exactly matching the previous 100 methylation states. If only sparse counts of strings of length 100 are available, the data will be insufficient to determine if a given string of length 100 in a test sample is abnormal.

Verification of the data structure:

once the data structure has been created, the analytics system may seek to validate 340 the data structure and/or any downstream models that use the data structure.

A first type of validation ensures that potential cancer samples are removed from the health control group so as not to affect the purity of the control group. This type of verification checks consistency within the control group data structure. For example, a health control group may comprise a sample from an individual not diagnosed with cancer, the sample comprising a plurality of aberrantly methylated fragments. The analysis system may perform various calculations to determine whether to exclude data from a subject that is apparently not diagnosed with cancer.

A second type of verification examines the probability model used to calculate p-values with a number of counts from the data structure itself (i.e., from the health control group). A procedure for p-value calculation is described below in connection with fig. 5. Once the analysis system generates a p-value for the plurality of methylation state vectors in the validation set, the analysis system constructs a Cumulative Density Function (CDF) from the plurality of p-values. With the CDF, the analytics system can perform various calculations on the CDF to validate the data structure of the control group. One test exploits the fact that the CDF should ideally be at or below an identity function, such that CDF (x) ≦ x. Conversely, above the identity function, some deficiencies in the probability model for the control group's data structure are revealed. For example, if the segment of 1/100 has a p-value score of 1/1000, meaning that CDF (1/1000) ═ 1/100 > 1/1000, then the second type of verification fails, indicating a problem with the probability model. See, for example, U.S. patent application No. 16/325,602, which is published as U.S. publication No. 2019/0287652, incorporated herein by reference in its entirety.

A third type of verification uses a healthy set of verification samples that are separate from the verification samples used to construct the data structure. The third type of verification tests whether the data structure is properly constructed and the model is operational. One exemplary procedure for performing this type of verification is described below in connection with fig. 3B. The third type of verification may quantify how well the health control group summarizes the distribution of several health samples. If the third type of verification fails, the health control group does not generalize well to the distribution of health.

A fourth type of verification is testing with samples from a non-health verification set. The analysis system calculates a number of p-values for the non-health verification group and constructs the CDF. For a non-health validation group, the analysis system expects to see cdf (x) > x for at least some samples. Or in other words, as opposed to what would be expected for a health control group and a health verification group in the second type of verification and the third type of verification. If the fourth type of verification fails, then the model is indicated and the anomaly for which the model was designed to be identified is improperly identified.

FIG. 3B is a flow chart describing an additional step 340 of validating the data structure for the control group of FIG. 3A, according to one embodiment. In this embodiment of the step 340 of verifying the data structure, the analysis system performs the fourth type of verification test as described above, which applies a verification set having a composition of objects, samples and/or fragments that are assumed to be similar to the control group. For example, if the analysis system selects several healthy subjects without cancer as a control group, the analysis system also uses several healthy subjects without cancer in the validation group.

The analysis system takes the validation set and generates a set of 100 methylation state vectors as described in FIG. 3A. The analysis system performs a p-value calculation for each methylation state vector from the validation set. The p-value calculation procedure will be further described in conjunction with fig. 4 to 5. For each possible methylation state vector, the analysis system calculates a probability from the data structure of the control group. Once the probabilities for the methylation state vectors are calculated, the analysis system calculates 350 a p-value score for the methylation state vector based on the calculated probabilities. The p-value score represents an expectation of finding that particular methylation state vector and other possible methylation state vectors with even lower probability in the control group. Thus, a low p-value score generally corresponds to a methylation state vector that is less expected relative to other methylation state vectors in the control group, and a high p-value score generally corresponds to a methylation state vector that is more expected relative to other methylation state vectors found in the control group. Once the analysis system generates a p-value score for the methylation state vectors in the validation set, the analysis system constructs 360 a Cumulative Density Function (CDF) with the p-value scores from the validation set. The analysis system verifies 370 the consistency of the CDF in the fourth type of verification test as described above.

Abnormally methylated fragments:

according to one embodiment, which is outlined in FIG. 4, several aberrant methylation fragments with aberrant methylation patterns in a cancer patient sample, a subject with a TOO's cancer, a subject with an unknown cancer, or a subject with another known disease state are selected as the target genomic region. An exemplary procedure 440 for selecting aberrant methylated segments is visually illustrated in FIG. 5 and further described below in the description of FIG. 4. In procedure 400, the analysis system generates 100 methylation state vectors from a number of cfDNA fragments of the sample. The analysis system processes each methylation state vector as follows.

For a given methylation state vector, the analysis system enumerates 410 all possibilities for methylation state vectors having the same starting CpG site and the same length (i.e., a collection of CpG sites) in the methylation state vector. Thus each methylation state may be methylated or unmethylated, there are only two possible states at each CpG site, and thus the unique possible count of methylation state vectors depends on the power of 2, while a methylation state vector of length n would correlate with 2n possibilities of a methylation state vector.

The analysis system calculates 420 each possible probability that a methylation state vector is observed for the identified starting CpG site/methylation state vector length by evaluating a health control group data structure. In one embodiment, the calculation observes that a given possible the probability uses a Markov chain probability (Markov chain probability) to model the joint probability calculation, which is described in more detail below with reference to FIG. 5. In other embodiments, a calculation method other than the Markov chain probability is used to determine the probability for each possible observed methylation state vector.

The analysis system calculates 430 a p-value score for the methylation state vector using the number of probabilities calculated for each possible. In one embodiment, this includes identifying the calculated probabilities corresponding to the likelihoods that would fit the methylation state vector under consideration. Specifically, this is the possibility of having the same set of CpG sites, or similarly the same starting CpG site and length, as the methylation state vector. The analysis system sums the calculated probabilities to generate the p-value score. The calculated probabilities are possible calculated probabilities, which may have any probability less than or equal to the probability of being recognized.

This p-value represents the probability that the methylation state vector or other even less likely methylation state vectors of the fragments were observed in the health control group. Thus, a low p-value score, roughly corresponding to a methylation state vector that is rare in a healthy individual, and results in the fragment being flagged as abnormally methylated relative to the healthy control group. A high p-value score is generally associated with a methylation state vector that is expected to exist in a relative conceptual sense in a healthy subject. For example, if the health control group is a non-cancer group, a low p-value indicates that the fragment is abnormally methylated relative to the non-cancer group, and thus may indicate the presence of cancer in the test subject.

As above, the analysis system calculates a p-value score for each of a number of methylation state vectors, each of the number of methylation state vectors representing a cfDNA fragment in the test sample. To identify which of the plurality of fragments is aberrantly methylated, the analysis system can filter 440 the set of a plurality of methylation state vectors based on p-value scores of the plurality of methylation state vectors. In one embodiment, filtering is performed by comparing the p-value score to a threshold and retaining only those segments below the threshold. The threshold p-value score may be on the order of 0.1, 0.01, 0.001, 0.0001, or the like.

And (3) calculating the P value by a numerical method:

fig. 5 is a depiction 500 of an exemplary p-value score calculation, according to an embodiment. To calculate a p-value score given a detected methylation state vector 505, the analysis system takes the detected methylation state vector 505 and enumerates 410 several possibilities for the methylation state vector. In this illustrative example, the detected methylation state vector 505 is<M23,M24,M25,U26>. Because the length of the detection methylation state vector 505 is 4, there are 2^4 possibilities for a methylation state vector containing CpG sites 23 through 26. In one general example, the number of possible methylation state vectors is 2^ n, where n is the length of the detected methylation state vector or alternatively the length of the sliding window (described further below).

The analysis system calculates 420 a number of possible probabilities for the methylation state vectors that are enumerated 515. Since methylation is conditionally dependent on the methylation state of nearby CpG sites, one way to calculate the likely probability of observing a given methylation state vector is to use the markov chain model. Typically, a methylation state vector, e.g.<S1,S2,...,Sn>(wherein S represents the methylation status, or is methylated (represented as M), unmethylated (represented as M) U) or indeterminate (denoted I)) have a joint probability that can be expanded using the chain of probabilities (I) to:

a markov chain model can be used to make the calculation of each possible probability of said condition more efficient. In one embodiment, the analysis system selects a hierarchy of Markov chains k corresponding to how many previous CpG sites in the vector (or window) are to be considered in the calculation of conditional probabilities, such that the conditional probabilities are modeled as P (S)n|S1,...,Sn-1)~P(Sn|Sn-k-2,...,Sn-1)。

To calculate the probability of modeling each possible markov chain of methylation vectors, the analysis system accesses the data structure of the control group, in particular the counts of the various strings of several CpG sites and states. To calculate P (M)n|Sn-k-2,...,Sn-1) Said analysis system is self-consistent<Sn-k-2,...,Sn-1,Mn>The data structure of (a) takes a ratio of stored counts of the number of strings divided by a count from a coincidence<Sn-k-2,...,Sn-1,Mn>And<Sn-k-2,...,Sn-1,Un>the sum of the stored counts of the number of strings of the data structure. Thus, P (M)n|Sn-k-2,...,Sn-1) Is a calculated ratio having the form:

the calculation may additionally perform a smoothing of the counts by applying an a priori distribution. In one embodiment, the prior distribution is a uniform prior as in laplacian smoothing. As an example of this, a constant is added to the numerator of the above equation and another constant (e.g., twice the constant in the numerator) is added to the denominator of the above equation. In other embodiments, an algorithmic technique, such as Nenieer-Ney smoothing, is used.

In the illustration, the formula expressed above is applied to the detected methylation state vector 505 covering sites 23-26. Once the calculated probabilities 515 are completed, the analysis system calculates 430 a p-value score 525, the p-value score 525 summing a number of probabilities that are less than or equal to the probability of a possible methylation state vector that matches the detected methylation state vector 505.

In one embodiment, the computational burden of computing the probability and/or p-value score may be further reduced by caching at least some of the computations. For example, the analysis system may cache calculations of possible probabilities of several methylation state vectors (or windows thereof) in temporary or permanent memory. Caching the probability allows for efficient calculation of p-score values without recalculating the potential probability if other fragments have the same CpG sites. Finally, the analysis system can calculate a p-value score for each of the several possibilities of methylation state vectors associated with a set of CpG sites from the vector (or window thereof). The analysis system may buffer the p-value score for use in determining the p-value score for other fragments that include the same CpG site. In general, the possible p-value scores of methylation status vectors having the same CpG site can be used to determine the p-value score of the possible different one from the same set of CpG sites.

Sliding the window:

in one embodiment, the analysis system uses 435 a sliding window to determine the likelihood of methylation state vectors and calculate p-values. The analysis system enumerates possible and calculates p-values only for a window of consecutive CpG sites, rather than for the entire methylation state vector, where the window is shorter in length (of CpG sites) than at least some fragments (otherwise, the window is useless). The window length may be static, user-determined, dynamic, or otherwise selected.

When calculating a p-value for a methylation state vector that is greater than the window, the window identifies a contiguous set of CpG sites from the vector in the window, starting with the first CpG site in the vector. The analysis system calculates a p-value score for the window including the first CpG site. The analysis system then "slides" the window to a second CpG site in the vector and calculates another p-value score for the second window. Thus, for a window of size l and a methylation vector length m, each methylation state vector will yield m-l +1 p-value scores. Upon completion of the p-value computation for each portion of the vector, the lowest p-value score from all sliding windows is taken as the overall p-value score for the methylation state vector. In another embodiment, the analysis system aggregates the p-value scores of the methylation state vectors to produce an overall p-value score.

The use of the sliding window helps to reduce the number of possible enumerated methylation state vectors and the corresponding probability calculations that need to be performed otherwise. An example probability calculation is shown in fig. 5, but in general, the number of possible methylation state vectors increases exponentially to the power of 2 with the size of the methylation state vector. To give a realistic example, a fragment may have more than 54 CpG sites. The analysis system may use, for example, a window of size 5 for the segment, resulting in 50 p-value calculations being performed for each of the 50 windows of the methylation state vector, rather than 2^54 (about 1.8 x 10^16) possible probabilities to generate a single p-value score. Each of the 50 computations enumerates 2^5(32) possibilities for the methylation state vector, resulting in a total of 50 x 2^5(1.6 x 10^3) possible computations. This results in a substantial reduction in the computations to be performed for accurate identification of anomalous segments that lack meaningful hits. This additional step may also be applied when verifying 340 the control group with a number of methylation state vectors of the verification group.

Identifying fragments indicative of cancer:

The analysis system identifies 450 a number of DNA fragments indicative of cancer from the filtered set of aberrant methylated fragments.

Hypomethylated and hypermethylated fragments:

according to a first method, the analysis system may identify from the filtered set of aberrant methylated fragments a number of DNA fragments deemed to be hypomethylated or hypermethylated as fragments indicative of cancer. Several fragments that are hypomethylated or hypermethylated can be defined as several fragments of a particular length (e.g., more than 3, 4, 5, 6, 7, 8, 9, 10, etc.) of several CpG sites that have a high percentage of methylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage in the range of 50% to 100%) or a high percentage of unmethylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage in the range of 50% to 100%).

Probability model:

according to one method described herein, the analysis system applies a probability model fitted to the methylation pattern for each cancer type and non-cancer type, identifying a number of fragments indicative of cancer. The analysis system uses a number of DNA fragments in the number of genomic regions to calculate a log likelihood ratio for a sample taking into account the various cancer types with a fitted probability model for each cancer type and non-cancer type. The analysis system may determine that a DNA fragment is indicative of cancer based on whether at least one of the log likelihood ratios considered relative to the various cancer types is above a threshold.

In one embodiment of partitioning the genome, the analysis system partitions the genome into regions through several stages. In a first stage, the analysis system separates the genome into blocks of CpG sites. Each block is defined at an interval of more than some threshold, e.g., more than 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, or 1000bp, in two adjacent CpG sites. From each block, the analysis system subdivides each block into regions of a particular length, e.g., 500bp, 600bp, 700bp, 800bp, 900bp, 1000bp, 1100bp, 1200bp, 1300bp, 1400bp, or 1500bp, in a second stage. The analysis system may further overlap the adjacent regions by a ratio of the lengths, e.g., 10%, 20%, 30%, 40%, 50%, or 60%.

The analysis system analyzes several sequence reads derived from several DNA fragments for each region. The analysis system can process several samples from tissue and/or high signal cfDNA. High signal cfDNA samples can be determined by a binary classification model, by cancer stage, or by other metrics.

For each cancer type and non-cancer, the analysis system fits a separate probability model for several fragments. In one embodiment, each probability model is a mixture model comprising a combination of mixture components, and each mixture component is an independent site model in which methylation at each CpG site is assumed to be independent of the methylation status at other CpG sites.

In several alternative embodiments, the calculations are performed for each CpG site. Specifically, a first count is determined which is the number of cancer samples comprising an abnormally methylated DNA fragment overlapping with the CpG (cancer _ count), and a second count is determined which is the total number (sum) of samples in the group containing fragments overlapping with the CpG. A plurality of genomic regions may be selected based on the plurality of numbers, for example, based on a criterion that positively correlates to the number of cancer samples comprising a DNA fragment overlapping with the CpG (cancer _ count), and negatively correlates to the total number (sum) of samples in the group containing fragments overlapping with the CpG.

The various types of cancers with different tos may be selected from the group consisting of breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of the renal pelvis, renal cancer other than urothelium, prostate cancer, anorectal cancer, anal cancer, colorectal cancer, hepatobiliary cancer caused by hepatocytes, hepatobiliary cancer caused by cells other than hepatocytes, liver/bile duct cancer, esophageal cancer, pancreatic cancer, gastric cancer, upper gastrointestinal squamous cell cancer, upper gastrointestinal cancer other than squamous cell cancer, head and neck cancer, lung adenocarcinoma, small cell lung cancer, squamous cell lung cancer and cancers other than adenocarcinoma or small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, plasma cell tumor, multiple myeloma, myelogenous tumor, lymphoma, and leukemia.

In some embodiments, classification and labeling of various cancer types can be performed using art-available classification methods, such as international classification of tumor disease (ICD-O-3) (code. iarc. fr) or surveillance, epidemiology, and terminal fruit plan (SEER). In other embodiments, the cancer types are classified in three orthogonal codes: (i) part codes (topographical codes), (ii) morphological codes (morphological codes), or (iii) behavioral codes (behavioral codes). According to the status code, benign tumor is 0, uncertain status is 1, carcinoma in situ is 2, malignant primary site is 3, and malignant metastasis site is 6.

In some embodiments, a cancer TOO, which will be used for staging the detected cancer, may be selected from a group defined by guidelines. For example, references Amin, m.b., Edge, s., Greene, f., Byrd, d.r., Brookland, r.k., Washington, m.k., Gershenwald, j.e., Compton, c.c., Hess, k.r., Sullivan, d.c., Jessup, j.m., Brierley, j.d., Gaspar, l.e., Schilsky, r.l., Balch, c.m., Winchester, d.p., Asare, e.a., Madera, m., Gress, d.m., Meyer, L.R (Eds.) edited by AJCC cancer staging handbook, 8 th edition, Springer, 2017, which together define the different cancer stages according to the standard set. Staging is often the next step in cancer management after cancer detection and diagnosis.

The analysis system may further consider various cancer types with a fitted probability model for each cancer type and non-cancer type or a cancer TOO, calculating a log likelihood ratio ("R") for a fragment, indicating a likelihood that the fragment is indicative of cancer. The two probabilities may be taken from probability models fitted for each cancer type and non-cancer type, the probability models defined to calculate a likelihood that a methylation pattern is observed on a fragment given each of the H cancer types and non-cancer types. For example, the number of probability models may be defined to be fitted for each of the number of cancer types and non-cancer types.

Selection of genomic regions indicative of cancer:

in some embodiments, the analysis system may identify 460 a number of genomic regions indicative of cancer. To identify these informative regions, the analysis system calculates an information gain for each genomic region, or more specifically for each CpG site, that describes the ability to discriminate between the various results.

A method for identifying genomic regions that are capable of distinguishing between cancer and non-cancer types, applying a trained classification model that can be applied to the set of aberrantly methylated DNA molecules or fragments corresponding to or derived from a cancer or non-cancer group. The trained classification model may be trained to identify any interesting cases that may be identified from the plurality of methylation state vectors.

In one embodiment, the trained classification model is a binary classifier trained based on cfDNA fragments or genomic sequences obtained from a population of subjects with cancer or a cancer TOO and a population of healthy subjects without cancer, and based on abnormal methylation state vectors, the binary classifier is then used to classify the probability that a test subject has cancer, a cancer TOO, or no cancer. In other embodiments, different classifiers may be used that are known to have a particular cancer (e.g., breast cancer, lung cancer, prostate cancer, etc.); a population of subjects known to have a particular TOO, from which the cancer is believed to originate, or known to have different stages of a particular cancer (e.g., breast cancer, lung cancer, prostate cancer, etc.) is trained. In these embodiments, several different classifiers may be trained using several sequence reads obtained from a tumor cell-rich sample. The sample is from a population of subjects known to have a particular cancer (e.g., breast cancer, lung cancer, prostate cancer, etc.). The ability of each genomic region to discriminate between a cancer type and a non-cancer type in the classification model is used to rank the genomic regions from most informative to least informative with respect to classification performance. The analysis system can identify a number of genomic regions from the ranking according to an information gain of classifying between non-cancer and cancer types.

Calculate the information gain from hypomethylated and hypermethylated fragments indicative of cancer:

according to one embodiment, using segments indicative of cancer, the analysis system may train a classifier according to a process 600 illustrated in FIG. 6A. The process 600 accesses two training sets of several samples: a non-cancer group and a cancer group, and obtaining 605 a non-cancer group comprising methylation state vectors of aberrant methylation fragments and a cancer group comprising methylation state vectors, e.g., via step 440 from procedure 400.

The analysis system determines 610, for each methylation state vector, whether the methylation state vector is indicative of cancer. Herein, several fragments indicative of cancer may be defined as hypermethylated or hypomethylated fragments if at least some number of CpG sites have a specific status (methylated or unmethylated, respectively) and/or sites having a threshold ratio are said specific status (again, methylated or unmethylated, respectively). In one embodiment, a plurality of cfDNA fragments are identified as hypomethylated or hypermethylated if the plurality of cfDNA fragments overlap with at least 5 CpG sites and at least 80%, 90%, or 100% of the CpG sites of the plurality of cfDNA fragments are methylated or at least 80%, 90%, or 100% of the CpG sites of the plurality of cfDNA fragments are unmethylated, respectively.

In an alternative embodiment, the analysis system considers portions of the methylation state vector and determines whether the portion is hypomethylated or hypermethylated and can distinguish whether the portion is hypomethylated or hypermethylated. This alternative solves the problem of missing methylation state vectors for several large-sized regions but containing at least one densely hypomethylated or hypermethylated region. This procedure defining hypomethylation and hypermethylation can be applied in step 450 of fig. 4. In another embodiment, the number of segments indicative of cancer may be defined according to a number of possibilities output from a number of trained probability models.

In one embodiment, the analysis system generates 620 a hypomethylation score (P) for each CpG site in the genomeIs low in) And a hypermethylation fraction (P)For treating). To generate two scores at a given CpG site, the classifier takes four counts at that CpG site: (1) a count of a number of (methylation state) vectors of the cancer group that overlap the CpG site, designated as hypomethylated; (2) a count of several vectors of the cancer group that overlap the CpG site, designated as hypermethylated; (3) a count of a number of vectors of the non-cancer group that overlap the CpG site, labeled hypomethylation; and (4) counts of several vectors of the non-cancer group that overlap with the CpG site, labeled as hypermethylation. In addition, the program can normalize these counts for each group to account for group size differences between the non-cancer group and the cancer group. In several alternative embodiments where several fragments indicative of cancer are more generally used, the several scores may be more broadly defined as the counts of the several fragments indicative of cancer at each genomic region and/or CpG site.

In one embodiment, to generate 620 a hypomethylation score at a given CpG site, the program takes a ratio of (1) divided by (1) and (3) summed. Similarly, the hypermethylation fraction is calculated by taking (2) divided by a ratio of (2) and (4). Further, these ratios may be calculated as discussed above with additional smoothing techniques (smoothing techniques). Given the presence of hypomethylation or hypermethylation of several fragments from the cancer set, the hypomethylation score and the hypermethylation score are correlated with an estimate of cancer probability.

The analysis system generates 630 a total hypomethylation score and a total hypermethylation score for each aberrant methylation state vector. The total hypermethylation and hypomethylation score is determined based on the plurality of hypermethylation and hypomethylation scores for the plurality of CpG sites in the methylation state vector. In one embodiment, the aggregate hypermethylation and hypomethylation scores are assigned as the maximum hypermethylation and hypomethylation scores, respectively, for the number of sites in each state vector. However, in several alternative embodiments, the several total scores may be based on an average, median, or other calculation of several hypermethylation/hypomethylation scores using the several sites in each state vector.

The analysis system then ranks 640 all methylation state vectors of objects resulting in two ranks for each object, the ranking 640 being an aggregate hypomethylation score by the number of methylation state vectors and an aggregate hypermethylation score by the number of methylation state vectors. The process selects a number of total hypomethylation scores from the hypomethylation rankings and a number of total hypermethylation scores from the hypermethylation rankings. Based on the selected scores, the classifier generates 650 a single feature vector for each object. In one embodiment, the scores selected from the two ranks are selected in a fixed ordering that is the same for each generated feature vector for each object in each of the training groups. As an example, in one embodiment, the classifier takes the first, second, fourth, and eighth total hypermethylation scores from each rank, and the same for each total hypomethylation score, and writes these scores in the feature vector for the object.

The analysis system trains 660 a binary classifier to distinguish the feature vectors of the cancer and non-cancer training sets. In general, any of a number of classification techniques may be used. In one embodiment, the classifier is a non-linear classifier. In a particular embodiment, the classifier is a non-linear classifier that employs an L2-normalized function kernel logistic regression (L2-normalized kernel logistic regression) with a Gaussian radial basis function kernel (RBF) kernel.

In particular, in one embodiment, the number (n) of non-cancer samples or different cancer type(s)Others) And the number of cancer samples or cancer type(s) (n) with an aberrant methylated fragment overlapping a CpG siteCancer treatment) Is counted. Then, the probability of a sample being cancer is estimated from a score ("S"), the score and nCancer treatmentIs in positive correlation with nOthersAnd presents negative correlation. The score may use the equation: (n)Cancer treatment+1)/(nCancer treatment+nOthers+2 or (n)Cancer treatment)/(nCancer treatment+nOthers) Is calculated. The analysis system calculates 670 an information gain for each cancer type and for each genomic region or CpG site to determine whether the genomic region or CpG site is indicative of cancer. The information gain is calculated for several training samples with a given cancer type compared to all other samples. For example, two random variables "abnormal fragments" ("AF") and "cancer type" ("CT") were used. In one embodiment, AF, as determined for the above abnormality score/feature vector, is a binary variable indicating whether an abnormal fragment overlaps a given CpG site in a given sample. CT is a random variable that indicates whether the cancer is of a particular type. The analysis system calculates mutual information (mutual information) about the CT given AF. That is, if it is known whether an abnormal fragment overlaps a specific CpG site, information on the type of cancer is obtained for how many bits.

For a given cancer type, the analysis system uses this information to rank several CpG sites based on how they are cancer specific. This procedure was repeated for all cancer types under consideration. If a particular region is commonly and abnormally methylated in training samples of a given cancer but not in training samples of other cancer types or in healthy training samples, then CpG sites that are overlapped by these abnormal fragments will tend to have high information gain for the given cancer type. For the several ranked CpG sites for each cancer type, a selected group of several CpG sites is greedy added (selected) for use in the cancer classifier based on their ranking.

Pair-wise information gains are calculated from the cancer-indicative segments identified from the probability model:

with several fragments indicative of cancer identified according to the second method described herein, the analysis may identify several genomic regions according to procedure 680 in fig. 6B. The analysis system defines 690 a feature vector for each sample, for each region, for each cancer type, the defining being by a count of DNA fragments having a value above a threshold, the fragments being indicative of a calculated log likelihood ratio of the cancer, wherein each count is a numerical value in the feature vector. In one embodiment, the analysis system counts the number of fragments present in a sample in a region for each cancer type having a log likelihood ratio above one or more possible thresholds. The analysis system defines a feature vector for each sample by a count of DNA fragments for each genomic region for each cancer type that provides a calculated log likelihood ratio for the fragments that is above a threshold value, wherein each count is a numerical value in the feature vector. The analysis system uses the number of defined feature vectors to calculate an informative score for each genomic region that describes the ability of the genomic region to distinguish between each pair of cancer types. For each pair of cancer types, the analysis system ranks a number of regions based on the number of informative scores. The analysis system may select a number of regions based on a ranking according to a number of information scores.

The analysis system calculates 695 an information score for each region that describes the region's ability to distinguish between each pair of cancer types. For each different pair of cancer types, the analysis system may assign one type as a positive type and the other as a negative type. In one embodiment, the ability of a region to distinguish between the positive and negative types is based on mutual information, expected to be non-zero (non-zero) in the final assay using the features, the estimated fraction (fraction) of cfDNA samples of the positive and negative types, i.e., at least one fragment of the layer, to be sequenced in a targeted methylation assay is calculated. These scores were those that occurred in healthy cfDNA, in high-signal cfDNA, and/or tumor samples of each cancer type using the features, and the observed rates were estimated. For example, if a feature occurs frequently in healthy cfDNA, the feature will also be expected to occur frequently in cfDNA of any cancer type, and will likely result in a low information score. The analysis system may select a particular number of regions, e.g., 1024, from the ranking for each pair of cancer types.

In a number of additional embodiments, the analysis system further identifies regions that are predominantly hypermethylated or hypomethylated from the ranking of regions. The analysis system may load the set of fragments into the positive type(s) for an area identified as providing information. The analysis system evaluates from the plurality of loaded fragments whether the plurality of loaded fragments are predominantly hypermethylated or hypomethylated. If the plurality of loaded fragments are predominantly hypermethylated or hypomethylated, the assay system may select a plurality of probes corresponding to the predominant methylation pattern. If the loaded fragments are not predominantly hypermethylated or hypomethylated, the assay system may use a mixture of probes directed to both hypermethylation and hypomethylation. The analysis system may further identify a minimal set of CpG sites that overlap with some ratio of the number of fragments.

In other embodiments, after ranking the number of regions based on a number of informative scores, the analysis system labels each region with the lowest informative ranking of all cancer types. For example, if a region is the 10 th most informative region for breast and lung cancer and the 5 th most informative region for breast and colorectal cancer, then the region will be given an overall label of "5". The analysis system may design several probes starting from several areas marked lowest and add several areas to the detection assembly, e.g., until the size budget of the detection assembly is exhausted.

Off-target genomic regions:

in some embodiments, several probes directed to several selected genomic regions are further filtered 475 based on the number of their off-target regions. This was to screen for probes that pull down too many cfDNA fragments corresponding to or derived from off-target genomic regions. Excluding probes with many off-target regions can be valuable by reducing off-target rates and increasing target coverage for a given amount of sequencing.

An off-target genomic region is a genomic region that has sufficient homology to a target genomic region such that DNA molecules or fragments derived from several off-target genomic regions are hybridized to and pulled down by a probe designed to hybridize to a target genomic region. An off-target genomic region may be aligned to a probe along at least 35bp, 40bp, 45bp, 50bp, 60bp, 70bp, or 80bp with a percent match of at least 80%, 85%, 90%, 95%, or 97%. In some embodiments, an off-target genomic region is a genomic region (or transformed sequence of the same region) that aligns to a probe along at least 45bp with at least 90% match rate. Various methods known in the art can be employed to screen for several off-target genomic regions.

Searching the genome thoroughly to find all off-target genomic regions can be computationally challenging. In some embodiments, a k-mer seeding strategy (which may allow for one or more mismatches) is bound to the local alignment of the seed sites. In this case, a thorough search for good alignment can be guaranteed based on the k-mer length, the number of allowed mismatches, and the number of k-mer seed hits at a particular location. This requires dynamic programming local alignment at a large number of locations, and thus is highly suitable for using vector CPU instructions (e.g., AVX2, AVX512) and can also be parallelized between many cores of a machine, and between many machines connected by a network. One of ordinary skill in the art will recognize that modifications and variations of this approach may be applied for the purpose of identifying several off-target genomic regions.

In some embodiments, several probes having sequences homologous to several off-target genomic regions, or comprising more than a threshold number of DNA molecules corresponding to or derived from several off-target genomic regions, are excluded (or filtered) from the detection combination. For example, probes having sequences homologous to DNA molecules of several off-target genomic regions, or off-target genomic regions corresponding to or derived from more than 30, more than 25, more than 20, more than 18, more than 15, more than 12, more than 10, or more than 5 off-target regions are excluded.

In some embodiments, depending on the number of off-target regions, several probes are divided into 2, 3, 4, 5, 6, or more separate groups. For example, several probes having sequence homology to DNA molecules that do not have off-target regions or that correspond to or are derived from several off-target regions are assigned to the high quality group, several probes having sequence homology to DNA molecules that have from 1 to 18 off-target regions or that correspond to or are derived from 1 to 18 off-target regions are assigned to the low quality group, and several probes having sequence homology to DNA molecules that have more than 19 off-target regions or that correspond to or are derived from 19 off-target regions are assigned to the poor quality group. Other cutoff values may be used for packets.

In some embodiments, several probes in the lowest quality group are excluded. In some embodiments, several probes in several groups other than the highest quality group are excluded. In some embodiments, separate detection sets are made for the probes in each set. In some embodiments, all probes are put on the same detection combination, but separate analyses are performed based on the assigned group.

In some embodiments, a detection combination has a greater number of high quality probes than the number of probes in a lower set. In some embodiments, a detection combination includes a smaller number of poor quality probes than the number of probes in other sets. In some embodiments, more than 95%, 90%, 85%, 80%, 75%, or 70% of the probes in a detection combination are high quality probes. In some embodiments, less than 35%, 30%, 20%, 10%, 5%, 4%, 3%, 2%, or 1% of the probes in a detection set are low quality probes. In some embodiments, less than 5%, 4%, 3%, 2%, or 1% of the probes in a detection combination are poor quality probes. In some embodiments, no bad quality probes are included in a detection assembly.

In some embodiments, probes having less than 50%, less than 40%, less than 30%, less than 20%, less than 10%, or less than 5% are removed. In some embodiments, probes having greater than 30%, greater than 40%, greater than 50%, greater than 60%, greater than 70%, greater than 80%, or greater than 90% are selectively included in a detection combination.

Methods of detecting combinations using cancer assays:

in yet another aspect, methods of detecting combinations (alternatively referred to as "decoys") using a cancer assay are provided. The method may comprise the steps of: treating a number of DNA molecules or a number of fragments (e.g., using bisulfite treatment) to convert unmethylated cytosines to uracil (as described herein) applying a cancer detection combination to the converted DNA molecules or fragments, enriching for a sub-combination of converted DNA molecules or fragments that bind to the number of probes in the detection combination, and sequencing the enriched cfDNA fragments. In some embodiments, the plurality of sequence reads may be compared to a reference genome (e.g., a human reference genome), allowing identification of methylation status at a plurality of CpG sites in the DNA molecule or fragment, and thus providing information relevant for cancer detection.

Analysis of sequence reads:

in some embodiments, the plurality of sequence reads can be aligned to a reference genome using methods known in the art to determine alignment position information. The alignment position information may indicate a start position and an end position of a start nucleotide base and an end nucleotide base in the reference genome corresponding to a given sequence read. The alignment position information may also include a sequence read length, which may be determined from the starting position and the ending position. A region in the reference genome can be associated with a gene or a segment of a gene.

In various embodiments, a sequence of reads comprises the sequence labeled R1And R2A read pair of (2). For example, the first reading R1Can be sequenced from a first end of a nucleic acid fragment, and the second read R2May be sequenced from a second end of the nucleic acid fragment. Thus, the first reading R1And the second reading R2Can be aligned consistently (e.g., in opposite directions) with the nucleotide bases of the reference genome. Derived from the reading pair R 1And R2May include information corresponding to a first reading (e.g., R)1) A starting position in the reference genome and corresponding to a second reading (e.g., R)2) At one end of the reference genome, an end position in the reference genome. In other words, the start position and the end position in the reference genome represent possible positions in the reference genome to which the nucleic acid fragments correspond. An output archive in the SAM (sequential alignment map) format or the BAM (binary alignment map) format may be generated and output for further analysis.

From the plurality of sequence reads, the location and methylation status of each CpG site can be determined based on alignment to a reference genome. Further, a methylation state vector for each fragment can be generated that specifies a position of the fragment in a reference genome (e.g., specified by the first CpG site in each fragment or other similar metric), the number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment, either methylated (e.g., as M), unmethylated (e.g., as U), or intermediate (e.g., as I). The methylation state vectors can be stored in temporary or permanent computer memory for later use and processing. Further, duplicate readings or duplicate methylation state vectors from a single object may be removed. In an additional embodiment, a particular fragment may be determined to have one or more CpG sites with an intermediate methylation state. Such several fragments may be excluded from subsequent processing, or optionally included when a downstream data model falls into such intermediate methylation states.

FIG. 7B is an illustration of the process 100 of FIG. 7A of sequencing a cfDNA fragment to obtain a methylation state vector, according to an embodiment. As an example, the analysis system takes a cfDNA fragment 112. In this example, the cfDNA fragment 112 includes three CpG sites. As shown, the first and third CpG sites of the cfDNA fragment 112 are methylated 114. In the processing step 120, the cytosine of the unmethylated second CpG site is converted to a uracil. However, the first and third CpG sites are not converted.

After conversion, a sequence library 130 is prepared and sequenced 140, resulting in a sequence read 142. The analysis system aligns 150 the sequence reads 142 to a reference genome 144. The reference genome 144 provides background information of where in a human genome the fragment cfDNA originates. In this simplified example, the analysis system aligns 150 the sequence reads, correlating three CpG sites to CpG sites 23, 24 and 25 (using any reference designator for ease of description). The analysis system thus yields information of the methylation status of all CpG sites on the cfDNA fragment 112 and where the several CpG sites map into the human genome. As shown, the several methylated CpG sites on sequence read 142 are read as cytosines. In this example, cytosines appear only at the first and third CpG sites in the sequence reads 142, allowing inference of in the original cfDNA fragment The first and third CpG sites are methylated. The second CpG site is read as a thymine (U is converted to T in the sequencing program), and thus, it can be inferred that the second CpG site is unmethylated in the original cfDNA fragment. With both the methylation status and the location, the analysis system generates 160 a methylation status vector 152 for the fragment cfDNA 112. In this example, the resulting methylation state vector 152 is<M23,U24,M25>Wherein M corresponds to a methylated CpG site, U corresponds to an unmethylated CpG site, and the subscript numbers correspond to the position of each CpG site in the reference genome.

FIGS. 8A-8B show three data graphs that verify consistency in the ordering of a control group. The first graph 170 shows the accuracy of conversion of unmethylated cytosines to uracil (step 120) on cfDNA fragments obtained from a test sample of several subjects at different cancer stages (stage zero, stage one, stage two, stage three, stage four, and non-cancer stage). As shown, the process of converting unmethylated cytosines on cfDNA fragments to uracils is consistent. The total conversion accuracy was 99.47% with a precision of ± 0.024%. The second graph 180 compares the coverage (depth of sequencing) of different stages of cancer. Only a few sequence reads that reliably map to a reference genome were calculated, with an average coverage of about 34 for all groups. The third graph 190 shows the cfDNA concentration of each sample at different stages of cancer.

Detection of cancer:

the several sequence reads obtained by the methods provided herein can be further processed by automated algorithms. For example, the analysis system is used to receive sequence data from a sequencer and perform various aspects of the processing as described herein. The analysis system may be one of a Personal Computer (PC), a desktop computer (desktop computer), a laptop computer (1 applet), a notebook computer (notwood), a tablet PC (tablet PC), a mobile device. A computing device may be communicatively coupled to the sequencer by a wireless, wired, or a combination of wireless and wired communication technologies. Generally, the computing device is configured with a processor and a memory that stores a number of computer instructions. The computer instructions, when executed by the processor, cause the processor to perform steps as described in the remainder of this document. Generally, the amount of genetic data and data derived from the genetic data is large enough, and the computational power required is so large that it is impossible to perform on paper alone or with human mind.

Clinical interpretation of methylation states of genomic regions of interest is a procedure that includes categorizing the clinical effects of each of the methylation states or a combination of the methylation states, and reporting the results in a manner that is meaningful to a medical professional. The clinical interpretation can be based on a comparison of the number of sequence reads to a database of cancer-specific or non-cancer subjects, and/or based on the number and type of cfDNA fragments identified from a sample that have cancer-specific methylation patterns. In some embodiments, several target genomic regions are ranked or classified based on their likelihood of being differentially methylated in several cancer samples, and the ranking or classification is used in the interpretation process. The ranking and classification may include (1) the type of clinical effect, (2) the strength of evidence of the effect, and (3) the size of the effect. Various methods of clinical analysis and genomic data interpretation can be used for the analysis of the several sequence reads. In some other embodiments, the clinical interpretation of the several methylation states of such several differently methylated regions can be based on a machine-learned approach that interprets a current sample based on a classification or regression method that is trained using the several methylation states of such several differently methylated regions from samples from cancer and non-cancer patients with known cancer states, cancer types, cancer stages, and tos, among others.

Clinically significant information can include the presence or absence of cancer in a broad sense, the presence or absence of a particular type of cancer, the stage of cancer, or the presence or absence of other types of disease. In some embodiments, the information is related to the presence or absence of one or more cancer types selected from the group consisting of breast cancer, endometrial cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of the renal pelvis, renal cell carcinoma, prostate cancer, anorectal cancer, anal cancer, colorectal cancer, hepatocellular cancer, liver/bile duct cancer, cancer of the bile duct and hepatic bile duct, pancreatic cancer, squamous cell carcinoma of the upper gastrointestinal tract, esophageal squamous cell carcinoma, head and neck cancer, lung cancer, squamous cell lung cancer, lung adenocarcinoma, small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, plasma cell tumor, multiple myeloma, myeloid tumor, lymphoma, and leukemia. In some embodiments, the information is related to the presence or absence of one or more cancer types selected from the group consisting of uterine cancer, upper gastrointestinal squamous carcinoma, all other upper gastrointestinal cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary cancer (hcc), hepatobiliary cancer, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer. In some embodiments, the information is related to the presence or absence of one or more cancer types selected from the group consisting of anal cancer, bladder cancer, colorectal cancer, esophageal cancer, head and neck cancer, liver/bile duct cancer, lung cancer, lymphoma, ovarian cancer, pancreatic cancer, plasma cell tumor, and gastric cancer. In some embodiments, the information is related to the presence or absence of one or more cancer types selected from the group consisting of thyroid cancer, melanoma, sarcoma, myeloid neoplasm, renal cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head and neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gall bladder cancer, upper digestive tract cancer, multiple myeloma, lymphoma, and lung cancer. In some embodiments, the number of samples are not cancerous and are from subjects with clonal expansion of white blood cells or no cancer.

A cancer classifier:

in some embodiments, the assay detection combinations described herein may be used with a cancer type classifier that predicts a disease state for a sample, such as a cancer type or non-cancer type prediction, a source tissue prediction, and/or an intermediate prediction, which in some embodiments may generate features based on sequence reads by accounting for methylated and unmethylated fragments of DNA at a particular genomic region of interest. For example, if the cancer type classifier determines that a methylation pattern at a segment is similar to the methylation pattern of a particular cancer type, the cancer type classifier can set a feature of the segment to 1, and if no such segment exists, the feature can be set to 0. In this way, the cancer type classifier can make a set of binary features (30000 features for example only) for each sample. Further, in some embodiments, all or a portion of the set of binary features of a sample may be input into the cancer type classifier to provide a set of probability scores, such as one probability score for each cancer type class and one non-cancer type class. Further, in some examples, the cancer type classifier may integrate or be used with thresholds to determine whether a sample should be referred to as cancerous or non-cancerous and/or intermediate thresholds to reflect confidence in a particular TOO designation. Such methods are further described below.

To train the cancer type classifier, the analysis system (e.g., analysis system 800, fig. 12B) may obtain a set of training samples. In some embodiments, each training sample comprises segment profile(s) (e.g., profiles containing sequence read data), a tag corresponding to a type of cancer (TOO) or non-cancer state of the sample, and/or the individual's gender of the sample. The analysis system may train the cancer type classifier using the training set to predict a disease state of the sample.

In some embodiments, to train, the analysis system divides the genome (e.g., the whole genome) or a primary combination of the genome (e.g., several target methylation regions) into several regions. By way of example only, portions of the genome may be divided into "blocks" of cpgs, and a new block may begin when the distance between nearest neighboring cpgs is at least a minimum separation distance (e.g., at least 500 bp). Further, in some embodiments, each block may be divided into several 1000bp regions and positioned such that adjacent blocks overlap by a certain amount (e.g., 50% or 500 bp).

Further, in some examples, the analysis system may divide the training component into K sub-combinations or folds (folds) to be used in a K-fold cross-validation. In some embodiments, the folds may be balanced for cancer/non-cancer status, source tissue, cancer stage, age (e.g., in 10-year bucket(s) groupings), and/or smoking status. In some examples, the training set was divided into 5 folds, thereby training five separate classifiers, in each case training on 4/5 of the several training samples and using the remaining 1/5 for validation.

When trained with the training set, the analysis system can fit a probability model to several fragments derived from samples of that type for each cancer type (and for healthy cfDNA). As used herein, a "probability model" is any mathematical model that is capable of assigning a probability to a sequence read based on the methylation state at one or more sites on the sequence read. In training, the analysis system fits a number of sequence reads derived from one or more samples of a number of subjects with a known disease, and can be used to apply methylation information or a number of methylation state vectors to determine a number of sequence read possibilities indicative of a disease state. In particular, in some cases, the assay system determines the observed methylation ratio for each CpG site in a sequence read. The methylation ratio represents the proportion or percentage of base pairs that are methylated in a CpG site. The trained probability model may be parameterized by a product of the several methylation ratios. In general, any known probability model for assigning probabilities to sequence reads from a sample can be used. For example, the probability model may be a bigram model in which each site (e.g., CpG site) on a nucleic acid fragment is assigned a probability of methylation, or may be an independent site model in which the methylation of each CpG is assigned a different probability of methylation, and methylation at one site is assumed to be independent of methylation at one or more other sites on the nucleic acid fragment.

In some embodiments, the probability model is a markov model in which the probability of methylation at each CpG site is dependent on the methylation state at some number of preceding CpG sites in the sequence read, or in the nucleic acid molecule from which the sequence read is derived. See, for example, U.S. patent application No. 16/352,602 entitled "abnormal segment detection and classification" and filed 3/13/2019, which is incorporated herein by reference in its entirety and may be used in various embodiments.

In some embodiments, the probability model is a "mixture model" that is fitted using a mixture of components from several underlying models. For example, in some embodiments, the several mixture components can be determined using a multiple independent site model, and methylation (e.g., the ratio of methylation) at each CpG site is assumed to be independent of methylation at other CpG sites. Applying an independent site model, a probability assigned to a sequence read, or to the nucleic acid molecule from which the sequence read is derived, is the product of the probability of methylation for each CpG site at which the sequence read is methylated and a subtraction of the probability of methylation for each CpG site at which the sequence read is unmethylated. According to this example, the analysis system determines a methylation ratio for each of the number of mixed components. The mixture model is parameterized by a sum of the plurality of mixture components, each of the plurality of mixture components being associated with a product of the plurality of methylation ratios. A probability model Pr of n mixed components can be represented by:

For an input segment, miEpsilon {0, 1} represents the observed methylation state of the fragment at position i of a reference genome, with 0 indicating unmethylated and 1 indicating methylated. The score assigned to each mixed component k is fkWherein f iskIs not less than 0 andthe methylation probability at position i in a CpG site of the mixture component k is βki. Thus, the probability of unmethylated is 1-. beta.ki. The number n of mixed ingredients may be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, etc.

In some embodiments, the analysis system fits the probability model using a maximum likelihood estimate to identify a set of parameters { β }ki,fkThe set of parameters [ beta ]ki,fkMaximize the log likelihood of all fragments derived from a disease state, subject to a normalized penalty imposed on each methylation probability by normalized (regularization) intensity r. The maximized magnitude of the N total segments can be expressed as:

in some examples, the analysis system performs a fit separately for each cancer and for healthy cfDNA. As will be understood by those skilled in the art, other means may be used to fit the probability models or to identify parameters that maximize the log likelihood of all sequence reads derived from the reference samples. For example, in some embodiments, Bayesian fitting (using, for example, Markov chain Monte Carlo) is used in which each parameter is not assigned a single value, but is associated with a distribution. In some embodiments, gradient-based optimization is used, wherein the likelihood (or log likelihood) for the several parameter values is used to step through a parameter space towards optimum. In other embodiments, the predictability is maximized, where a set of potential parameters (e.g., the identity of the mixture constituents in each fragment from which they are derived) are set to their predictability under a number of previous model parameters, and then model parameters are assigned to maximize the likelihood given the assumed values of these potential variables. The two-step procedure is then repeated until convergence.

Further, in some examples, the analysis system may generate several features for each sample in the training set. For example, for each sample (regardless of label), in each region, for each cancer, for each fragment, the analysis system may evaluate the log likelihood ratio R with several probability models fitted according to the following equation:

then, for each sample, for each region, for each cancer type, the analysis system may count as having R for each set of "layer (tier)" valuesCancer typeNumber of fragments > layer, and designate those counts as non-negative integer valued features. For example, the number of layers includes thresholds of 1, 2, 3, 4, 5, 6, 7, 8, and 9, resulting in 9 signatures for each cancer type.

In some embodiments, the analysis system may select particular features for inclusion in a feature vector for each sample. For example, for each different pair of cancer types, the analysis system may assign one type as a "positive type" and the other as a "negative type" and rank the features by their ability to distinguish between the types. In some cases, the ranking is based on mutual information calculated by the analysis system. For example, the mutual information may be calculated using estimated ratios of samples of the positive type and the negative type (e.g., cancer types a and B) for which the feature is expected to be non-zero in a result assay. For example, if a feature frequently occurs in healthy cfDNA, the analysis system determines that the feature is unlikely to frequently occur in cfDNA associated with various types of cancer. Thus, the characterization may be a weak criterion in distinguishing between several disease states. In calculating the mutual information I, the variable X is a specific characteristic (e.g., binary characteristic) and the variable Y represents a disease state, e.g., cancer type a or B:

p(1|A)=fA+fH-fHfA

The joint probability mass function of X and Y is p (X, Y), and the marginal probability mass functions are p (X) and p (Y). The analysis system may assume a priori that feature deletions are informative and that each disease state is equally likely, for example, p (Y ═ a) ═ p (Y ═ B) ═ 0.5. The probability that a given binary characteristic of cancer is observed (e.g., in cfDNA) is represented by p (1| a), and fAIs the opportunity to observe the feature in a ctDNA sample (or high signal cfDNA sample) from a tumor associated with cancer AA ratio of fHIs the probability that the feature is observed in a healthy or non-cancerous cfDNA sample.

In some embodiments, only features corresponding to the positive types are included in the ranking, and only when the predicted incidence of these features is higher in the positive types than in the negative types. For example, if "liver" is the positive type and "breast" is the negative type, then only the "liver _ x" features are considered and only if their expected occurrence in liver cfDNA is greater than their expected occurrence in breast cfDNA. Further, in some embodiments, for each region, for each cancer type pair (including cancer as a negative-going type), the analysis system maintains only the layer that performs best. Further, in some embodiments, the analysis system converts several feature values by binarization such that any feature value greater than 0 is set to 1 and all features are either 0 or 1.

In some examples, the analysis system trains a polynomial logistic regression classifier on a discounted training data and generates predictions for the retention excluded data. For example, for each of the K folds, a logistic regression may be trained for each combination of hyper-parameters (hyper-parameters). Such a number of hyper-parameters may include the L2 penalty and/or topK (e.g., the number of high ranked regions retained per tissue type pair (including non-cancer), as ranked by the mutual information program outlined above). For each pair of hyper-parameters, performance is evaluated on cross-validation predictions of the full training set, and the set of hyper-parameters with the best performance is selected for retraining on the full training set. In some examples, the analysis system uses log-loss as a performance metric, so the log-loss is calculated by taking the negative log of the prediction of the correct label for each sample, then summing between several samples (i.e., 1.0 for a perfect prediction of the correct label, would give a log-loss of 0).

To generate a prediction for a new sample, several feature values are calculated using the same method described above, but narrowed to several features (area/positive class combinations) that are selected at the selected topK value. The features generated are then used to create a prediction using the logistic regression model trained above.

In some embodiments, the analysis trains a two-stage classifier. For example, the analysis system trains a binary cancer classifier based on the feature vectors of the training samples to distinguish between the tags, cancer and non-cancer. In this case, the binary classifier outputs a prediction score that indicates a likelihood of the presence or absence of cancer. In another embodiment, the analysis system trains a multi-class cancer classifier to discriminate between a number of cancer types. In this multi-class cancer classifier, the cancer classifier is trained to determine a cancer prediction comprising a predictive value for each of the several cancer types for which it is classified. The number of predictors may correspond to a likelihood that a given sample has each of the number of cancer types. For example, the cancer classifier returns a prediction of cancer, including a predictive value of breast cancer, lung cancer, and non-cancer. For example, the cancer classifier can return a cancer prediction for a test sample that includes a prediction score for breast cancer, lung cancer, and/or non-cancer.

The analysis system may train the cancer classifier according to any one of several methods. As an example, the binary cancer classifier may be an L2 normalized logistic regression classifier trained using a log-loss function. As another example, the multiple cancer (TOO) classifier may be a polynomial logistic regression. In application, both types of cancer classifiers may be trained using other techniques. These techniques are numerous and include the potential application of kernel methods, machine learning algorithms such as multi-layer neural networks, and the like. In particular, methods as described in PCT/US2019/022122 and U.S. patent application No. 16/352,602, which are incorporated herein by reference in their entirety, may be used for various embodiments. Furthermore, in some examples, the TOO classifier is trained only on samples that are successfully referred to as cancer by the binary classifier. In some examples, the binary classifier is trained on training samples other than a TOO.

Exemplary sequencer and analysis System:

FIG. 12A is a flow diagram of systems and devices for sequencing nucleic acid samples according to one embodiment. The exemplary flow diagram includes several devices, such as a sequencer 820 and an analysis system 800. The sequencer 820 and the analysis system 800 may work together to perform one or more steps in the processes described herein.

In various embodiments, the sequencer 820 receives one enriched nucleic acid sample 810. As shown in fig. 12A, the sequencer 820 may include a graphical user interface 825, the graphical user interface 825 allowing user interaction at specific tasks (e.g., initiating sequencing or terminating sequencing), and one or more loading stations 830 for loading a sequencing cartridge (sequencing cartridge) that includes enriched fragment samples and/or buffers necessary for performing the sequencing assay. Thus, once a user of the sequencer 820 provides the necessary reagents and sequencing cassettes to the loading station 830 of the sequencer 820, the user can initiate sequencing by interacting with the graphical user interface 825 of the sequencer 820. Once initiated, the sequencer 820 performs sequencing and outputs sequence reads of the number of enriched fragments from the nucleic acid sample 810.

In some embodiments, the sequencer 820 is communicatively coupled with the analysis system 800. The analysis system 800 includes a number of computing devices for processing the plurality of sequence reads for various applications, such as assessing methylation status at one or more CpG sites, variable calling, or quality control. The sequencer 820 can provide the number of sequence reads in BAM archive format to the analysis system 800. The analysis system 800 may be coupled to the sequencer 820 via a wireless, wired, or a combination of both communication technologies. Generally, the analysis system 800 is configured with a processor and a non-transitory computer readable storage medium that stores computer instructions. The number of computer instructions, when executed by the processor, cause the processor to perform one or more steps of any of a number of methods or programs disclosed herein.

In some embodiments, the plurality of sequence reads can be aligned to a reference genome using methods known in the art to determine alignment position information. The alignment positions can generally describe a start position and an end position of a region in the reference genome corresponding to an initial nucleotide base and an end nucleotide base of a given sequence read. The alignment position information can be summarized to indicate a first CpG site and a last CpG site included in the sequence reads based on alignment to the reference genome corresponding to methylation sequencing. The alignment information may further indicate the methylation status and position of all CpG sites in a given sequence read. A region in the reference genome can be associated with a gene or a segment of a gene. As such, the analysis system 800 can align one or more genes to a sequence read to mark the sequence read. In one embodiment, the segment length (or size) is determined from the start and end positions.

In various embodiments, for example, when a paired-end sequencing program is used, one sequence read comprises one read pair, denoted as R _1 and R _ 2. For example, the first read R _1 can be sequenced from a first end of a double-stranded DNA (dsDNA) molecule, and the second read R _2 can be sequenced from a second end of the double-stranded DNA (dsDNA). Thus, the nucleotide base pairs of the first read R _1 and the second read R _2 can be consistently aligned (e.g., in opposite orientations) with the nucleotide bases of the reference genome. The alignment position information derived from the pair of reads R _1 and R _2 can include a start position in the reference genome corresponding to one end of a first read (e.g., R _1) and an end position in the reference genome corresponding to one end of a second read (e.g., R _ 2). In other words, the start and end positions in the reference genome represent the possible positions to which the nucleotide fragment corresponds in the reference genome. In one embodiment, the read pairs R _1 and R _2 can be combined into a segment, and the segment is used for subsequent analysis and/or classification. An output archive in SAM (sequential alignment map) format or BAM (binary) format may be generated and output for further analysis.

Referring now to FIG. 12B, FIG. 12B is a block diagram of an analysis system 800 for processing DNA samples according to one embodiment. The analysis system employs one or more computing devices for use in analyzing a plurality of DNA samples. The analysis system 800 includes a sequence processor 840, a sequence database 845, a model database 855, a plurality of models 850, a parameter database 865, and a scoring engine 860. In some embodiments, the analysis system 800 performs one or more steps of the process 300 of fig. 3A, the process 340 of fig. 3B, the process 400 of fig. 4, the process 500 of fig. 5, the process 600 of fig. 6A, or the process 680 of fig. 6B, among other processes described herein.

The sequence processor 840 generates methylation state vectors for fragments from a sample. At each CpG site on a fragment, the sequence processor 840, via the routine 300 of FIG. 3A, generates a methylation status vector for each fragment that specifies the position of the fragment in the reference genome, the number of CpG sites in the fragment, and whether the methylation status of each CpG site in the fragment is methylated, unmethylated, or intermediate. The sequence processor 840 may store methylation state vectors for several fragments in the sequence database 845. The data in the sequence database 845 can be organized such that the several methylation state vectors from a sample are associated with each other.

Further, a plurality of different models 850 may be stored in the model database 855, or recycled for use with several test samples. In one embodiment, a model is a trained cancer classifier for determining a cancer prediction for a test sample using a feature vector derived from a plurality of abnormal segments. The training and use of the cancer classifier is discussed elsewhere herein. The analysis system 800 may train one or more models 850 and store various training parameters in the parameter database 865, and the analysis system 800 stores the number of models 850 along with the number of functions in the model database 855.

In inference, the scoring engine 860 uses the one or more models 850 to return output. The scoring engine 860 accesses the number of models 850 in the model database 855 along with a number of trained parameters from the parameter database 865. According to each model, the parameter engine receives an input appropriate for each model and calculates an output based on the received input, the plurality of parameters and a function of each model relating the input and the output. In some use cases, the scoring engine 860 further computes a number of metrics associated with a confidence in the computed output from the model. In other use cases, the scoring engine 860 calculates other intermediate values for use in the model.

Cancer and therapy monitoring:

in particular embodiments, the first time point is prior to cancer treatment (e.g., prior to resection surgery or therapeutic intervention), the second time point is after cancer treatment (e.g., after resection surgery or therapeutic intervention), and the method is for monitoring treatment effectiveness. For example, if the second likelihood or probability score (probability score) is lower than the first likelihood or probability score, the treatment is considered successful. However, if the second likelihood or probability score increases over the first likelihood or probability score, the treatment is deemed unsuccessful. In other embodiments, both the first time point and the second time point precede cancer treatment (e.g., prior to resection surgery or therapeutic intervention). In other embodiments, the first time point and the second time point are both after cancer treatment (e.g., prior to resection surgery or therapeutic intervention), and the method is used to monitor the effectiveness of the treatment or the loss of effectiveness of the treatment. In other embodiments, cfDNA samples can be obtained from a cancer patient at a first time point and a second time point and analyzed, for example, to monitor cancer progression, to determine whether the cancer is in remission (e.g., post-treatment), to monitor or detect residual disease or disease recurrence, or to monitor treatment (e.g., cure) efficacy.

One skilled in the art will readily appreciate that test samples can be obtained from a cancer patient at any desired time point and analyzed according to the methods of the present invention to monitor a cancer status of the patient. In some embodiments, the first and second time points consist of a period of time ranging from about 15 minutes to about 30 years (e.g., about 30 minutes), e.g., about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or about 24 hours; for example, about 1, 2, 3, 4, 5, 10, 15, 20, 25, or about 30 days, or for example about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 months, or for example about 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5, 18, 18.5, 19, 19.5, 20, 20.5, 21, 21.5, 22, 22.5, 23, 23.5, 24, 24.5, 25, 25.5, 26, 26.5, 27, 27.5, 28.5, 29.5, or about 30 years. In other embodiments, the test sample may be obtained from the patient at least once every 3 months, at least once every 6 months, at least once a year, at least once every 2 years, at least once every 3 years, at least once every 4 years, or at least once every 5 years.

Treatment:

in another embodiment, information obtained from any of the methods described herein (e.g., likelihood or probability score) can be used to make or influence a clinical decision (e.g., cancer diagnosis, treatment selection, treatment effect assessment, etc.). For example, in one embodiment, if the likelihood or probability score exceeds a threshold, the physician may prescribe an appropriate treatment (e.g., resection surgery, radiation therapy, chemotherapy, and/or immunotherapy). In some embodiments, information such as a likelihood or probability score may be provided as a reading to a physician or subject.

A classifier (as described herein) can be used to determine a likelihood or probability score that a sample feature vector is from a subject with cancer. In one embodiment, when the likelihood or probability exceeds a threshold, an appropriate treatment (e.g., an ablation procedure or therapeutic treatment) is given. For example, in one embodiment, if the likelihood or probability score is greater than or equal to 60, one or more appropriate treatments are given. In another embodiment, if the likelihood or probability score is greater than or equal to 65, greater than or equal to 70, greater than or equal to 75, greater than or equal to 80, greater than or equal to 85, greater than or equal to 90, or greater than or equal to 95, then one or more appropriate treatments are given. In other embodiments, a cancer log-odds ratio may indicate the effectiveness of a cancer treatment. For example, an increase in cancer log odds ratio over time (e.g., the first second after treatment) may indicate that the treatment is ineffective. Similarly, a decrease in cancer log odds ratio over time (e.g., the first second after treatment) may indicate successful treatment. In another embodiment, one or more appropriate treatments are given if the cancer log odds ratio is greater than 1, greater than 1.5, greater than 2, greater than 2.5, greater than 3, greater than 3.5, or greater than 4.

In some embodiments, the treatment is one or more cancer therapeutic agents selected from the group consisting of a chemotherapeutic agent, a targeted cancer therapeutic agent, a differentiation therapeutic agent, a hormonal therapeutic agent, and an immunotherapeutic agent. For example, the therapeutic agent can be one or more chemotherapeutic agents selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, antitumor antibiotics, cytoskeletal disruptors (taxan), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents, and any combination thereof. In some embodiments, the treatment is one or more targeted cancer therapeutics selected from the group consisting of signal transduction inhibitors (e.g., tyrosine kinase and growth factor receptor inhibitors), Histone Deacetylase (HDAC) inhibitors, retinoic acid receptor agonists, proteosome inhibitors, angiogenesis inhibitors, and monoclonal antibody conjugates. In some embodiments, the treatment is one or more differentiation therapeutic agents including tretinoin, e.g., tretinoin, alitretinoin, and bexarotene. In some embodiments, the treatment is one or more hormonal therapy agents selected from the group consisting of antiestrogens, aromatase inhibitors, progestins, antiandrogens, gonadotropin-releasing hormone agonists, or the like. In some embodiments, the treatment is one or more immunotherapeutic agents selected from the group comprising monoclonal antibody therapies such as Rituximab (RITUXAN) and alemtuzumab (CAMPATH), non-specific immunotherapies and adjuvants such as bcg, interleukin-2 (IL-2) and interferon alpha, immunomodulatory drugs such as thalidomide and lenalidomide (revalimid). The skilled physician or oncologist has the ability to select an appropriate cancer therapeutic agent based on the type of tumor, the stage of the cancer, the cancer treatment or therapeutic agent previously contacted, and other characteristics of the cancer.

Example (c):

the following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how the present disclosure may be made and used, and are not intended to limit the scope of what the inventors regard as their description nor are they intended to represent that the experiments below are all or the only experiments performed. Efforts have been made to ensure accuracy with respect to numbers used (e.g., amounts, temperature, etc.) but some experimental error and deviation should be accounted for.

Example 1: analysis of Probe measurements

To test how much overlap is needed between a cfDNA fragment and a probe to achieve an insignificant amount of pull down, various lengths of overlap were tested using detection combinations designed to include three different types of probes (VID3, VID4, VIE 2). The three different types of probes had various overlaps with several 175bp target DNA fragments specific for each probe. The overlap range tested was between 0bp and 120 bp. Several samples comprising 175bp target DNA fragments were applied to the detection set and washed, and then several DNA fragments linked to the several probes were collected. The amount of several DNA fragments collected was measured and the amount was plotted as density versus size of overlap, as provided in fig. 10.

When the overlap is less than 45bp, there is no significant binding and pull-down of the target DNA fragment. These results show that a fragment-probe overlap of at least 45bp is generally required to achieve a non-negligible amount of pull-down, although this number may vary depending on assay conditions.

Further, it was shown that more than 10% mismatch rate between the probe and fragment sequences in the overlapping region was sufficient to substantially interfere with binding and thus pull-down efficiency. Thus, several sequences that can align to the probe along at least 45bp with at least 90% match rate are candidates for off-target pull down.

Therefore, we performed an exhaustive search of all genomic regions aligned at 45bp (i.e., off-target regions) with a 90% + match rate for each probe. Specifically, we incorporate a k-mer seeding strategy (which may allow for one or more mismatches) with local alignment at several seed positions. This ensures that the number of mismatches allowed, based on the length of the k-mer, does not miss any good alignment with the number of k-mer seed hits at a particular location. This involves performing dynamic programmed local alignment in a large number of locations, so this approach is suitable for using vector CPU instructions (e.g., AVX2, AVX512) and parallelizing between many cores of one machine, and between many machines connected by a network. This allows an exhaustive search that is valuable when designing a high performance detection combination (i.e., low miss rate and high target coverage for a given amount of sequencing).

After the exhaustive search, each probe was scored based on the number of off-target regions. Most probes have a fraction of 1, meaning that they fit only one place. Several probes with a score between 2 and 19 were accepted but several probes with more than 20 scores were discarded. Other cutoff values may be used for particular samples. Probes that target over-methylated regions tend to have fewer off-target regions than probes that target other regions.

Example 2: annotation of target genomic regions

The target genomic region determined by the process outlined in fig. 4 was analyzed to understand the characteristics of the target region. Specifically, the selected target genomic region is aligned with a reference genome to determine a plurality of alignment positions. Collecting alignment position information for each selected target genomic region, the alignment position information comprising: chromosome number, starting nucleotide base, terminating nucleotide base, and genome annotation for a given genomic region. The genomic region of interest is located in an intron, an exon, an intergenic region, a 5 'UTR, a 3' UTR or a control region, such as a promoter or enhancer. The number of target genomic regions within each genome annotation is counted and plotted in the chart provided in fig. 11. Figure 11 also compares the number of selected target genomic regions (black bars) or randomly selected genomic regions (gray bars) within each genome annotation.

The analysis indicated that the selected genomic regions of interest were not random in their genomic distribution, and that the selected genomic regions of interest were more rich in regulatory and functional elements (e.g., promoter and 5' UTR) and less representative of intergenic sequences than randomly selected targets of the same size. For example, the genomic region of interest is found in the promoter, 5 'UTR, exon, intron/exon boundary, intron, 3' UTR or enhancer, rather than the intergenic region.

Example 3: cancer assay detection combination for detecting cancer and cancer type

Samples for genomic region selection: the DNA samples for this work were from different sources.

Circulating free cell genome mapping studies (CCGA; Clinical Trial. gov identifier (NCT02889978) is a prospective, multicenter, case-controlled, observational study with longitudinal follow-up.unidentified biological samples were collected from approximately 15000 participants at 142 sites.

Cancer genomic map ("TCGA"; Clinical Trial. gov identifier NCT02889978) is a common resource developed by the National Cancer Institute (NCI) in cooperation with the national human genome institute (NHGRI).

Isolated tumor cells (DTC) were obtained from Conversant.

Non-cancer cells are provided by Yuval Dor and Ben Glaser (university of hebrew) and are derived from human tissue obtained by standard clinical procedures. For example, breast luminal and basal epithelial cells are from breast reduction surgery; colonic epithelial cells are from tissue near the site of re-implantation after resection of a localized colonic pathological segment; bone marrow cells from joint replacement surgery; vascular and arterial endothelial cells originate from vascular surgery; the head and neck epithelium comes from tonsillectomy.

WGBS was performed on over 1000 samples of genomic DNA taken from healthy individuals and individuals diagnosed with cancer at various stages and tissues of origin. The samples included formaldehyde fixation, paraffin embedded (FFPE) tissue blocks, Disseminated Tumor Cells (DTC) of cancers of different tou, Bone Marrow Mononuclear Cells (BMMC), White Blood Cells (WBC), and Peripheral Blood Mononuclear Cells (PBMC). Prior to isolation of gDNA, several DTCs were negatively selected using a negative selection kit to remove several WBCs, several fibroblasts and several endothelial cells. Negative selection resulted in several purified tumor cells, allowing the differentially methylated regions to be more clearly identified.

The TCGA data was collected by shuffling the bisulfite converted DNA fragments from 8809 samples onto several methylation sensitive oligonucleotide arrays. The β values in this study represent the relative abundance of methylation at 480000 CpG sites. After excluding CpG and CpG sites from the noisy genomic region (360000) using several cross-heterozygous probes (45000), 75000 CpG sites were analyzed. The TCGA data were analyzed using different algorithms, as the TCGA data describe the methylation of individual CpG sites, while the WGBS data reveal the methylation pattern of several strings of adjacent CpG sites on several DNA fragments.

Category of source organization: each sample was classified into one of twenty-five (25) different tissue of origin (TOO) classes (i.e. cancer types): breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of the renal pelvis, renal cancer other than urothelium, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer caused by hepatocytes, hepatobiliary cancer caused by cells other than hepatocytes, pancreatic cancer, upper gastrointestinal squamous cell cancer, upper gastrointestinal cancer other than squamous cell cancer, head and neck cancer, adenocarcinoma of the lung, small cell lung cancer, squamous cell lung cancer, and cancers other than adenocarcinoma or small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, multiple myeloma, lymphoma, and leukemia. These TOO cover 97% of the cancer incidence reported by monitoring, epidemiology and end result programs (SEER; SEER. cancer. gov) after filtration of fluid, brain, small intestine, vagina and vulva, as well as penis and testis. Rare occurring cancers such as sarcomas and neuroendocrine cancers are pooled to prevent misclassification. The site, morphology and status codes of the international classification of neoplastic disease (ICD-O-3) and the site nomenclature of the World Health Organization (WHO) are used to classify individual samples into several oto classes. For example, as shown in table 1, 34 TCGA studies were mapped to 25 TOO categories. Based on the observed classification performance, the TOO classes are iteratively optimized.

Table 1: several TCGA-Type Organization of Origin (TOO) classes

And (3) area selection: for target selection, several fragments with aberrant methylation patterns in several cancer samples were selected using one or more methods as described herein. The use of these methods allows the identification of several low noise regions as targets for estimation. In these low noise regions, several segments that are most informative in distinguishing cancer types are ranked and selected.

In particular, in some embodiments, when WGBS data is used, several fragment sequences in the database are filtered based on p-value using a non-cancer profile, and only fragments with p-values less than 0.001 are retained, as described herein. In some cases, the several selected cfdnas are further filtered, leaving only cfDNA that is at least 90% methylated or 90% unmethylated. Next, for each CpG site in the selected fragments, the number of cancer samples or non-cancer samples including fragments overlapping with the CpG site is calculated. Specifically, P (cancer | overlapping fragments) for each CpG was calculated and several genomic loci with high P values were selected as general cancer targets. By design, the several selected segments have very low noise (i.e., few non-cancer segments overlap). In order to find several cancer type specific targets, a similar selection process is performed. CpG sites are ranked based on their information gain, and the numbers of (i) samples of a particular TOO or other samples, including non-cancer samples and samples of a different TOO, (ii) samples of a particular TOO or a number of non-cancer samples, and/or (iii) samples of a particular TOO or samples of a different TOO, including fragments overlapping the CpG sites, are compared. The procedure is applied to each of the 25 TOOs, and all pairwise combinations of the comparisons to the 25 TOOs are completed. For example, P (cancer | overlapping segments of a TOO) is calculated and then compared to P (cancer | overlapping segments of a different TOO). An outlier fragment of each TOO is selected as a target for TOO, which has a greater likelihood of being under a cancer of one TOO than under a cancer of a different TOO. Thus, genomic regions selected by pairwise comparison include differentially methylated genomic regions to isolate a target TOO and a control TOO.

Additional genomic regions of interest were selected according to the method described in the section above entitled "calculating paired information obtained from cancer indicative fragments identified from a probability model". The number of genomic regions used to distinguish each target TOO (x-axis) from a comparative TOO (y-axis) is provided in FIG. 13.

When TCGA data are used, CpG β values indicating the methylation intensity are used to identify the genomic region of interest. This is because the array data are not at the CpG site level and therefore they are prone to false positives. To avoid false positives, CpG sites were converted to binary files (bins) of 350bp spanning the genome. The beta value for each binary is calculated as the average of the CpG beta values in the binary. Binary files with less than 2 cpgs were excluded from the analysis. Then, a binary file is selected that (i) has a beta difference between a particular TOO and other samples greater than 0.95, wherein the other samples include samples of non-cancer samples and different TOOs, (ii) has a beta difference between a particular TOO and non-cancer samples greater than 0.95, and/or (iii) has a beta difference between a particular TOO and samples of different TOOs greater than 0.95, wherein the different TOOs include fragments that overlap the CpG site.

Several genomic regions selected as described above were then filtered according to the number of off-target genomic regions specified in 4.4.7. Specifically, the number of several genomic positions with ═ 45bp and aligned with ═ 90% identity was calculated as the number of off-target genomic regions. Genomic regions with more than 20 off-target genomic regions were discarded.

A list of various target genomic regions selected as described in this section is shown in table 2. These lists have several groups that differ but overlap several genomic regions of interest. They differ in the total number of the target genomic region, the total length of the target genomic region, and the chromosomal location of the target genomic region. Tables 1 to 3 are small, medium and large detection combinations. The number of target genomic regions of tables 4-16 have a number of subsets of the number of CpG methylation sites found in the number of target genomic regions of table 3. Lists 4, 6, 8 through 16 are filtered to exclude previously known target genomic regions.

Table 2: several SEQ ID NOs corresponding to lists 1 to 16. For each list, the table identifies the total number of target genomic regions in the list, a series of SEQ ID NOs, corresponding to all target genomic regions in the list to be foraged in the sequence listing filed with the application, and the sum of the lengths of all target genomic regions in the list. The sequence listing identifies the chromosomal location of each genomic region of interest, from which the cfDNA to be enriched is hypermethylated or hypomethylated, and the sequence of one DNA strand of the genomic region of interest. Chromosome numbers and start and stop positions are provided relative to the known human reference genome hg 19. The sequence of the human Reference Genome hg19 is available from the Genome Reference Consortium (Genome Reference Consortium) under Reference number GRCh37/hg19, and also available from the Genome Browser (Genome Browser) provided by Santa Cruz Genomics Institute (Santa Cruz Genomics Institute).

SEQ ID NO452706-483478 provides further information on certain hypermethylated or hypomethylated target genomic regions. These SEQ ID NOs record the genomic regions of interest identified that can be differentially methylated in samples from a particular pair of cancer types. The genomic regions of interest of SEQ ID NO452706-483478 were taken from Table 6. Many of the same genomic regions of interest are also found in tables 1 to 5 and 7 to 16. Each entry for SEQ ID indicates the chromosomal location of the genomic region of interest relative to hg19, whether the cfDNA fragment to be enriched from that region is hypermethylated or hypomethylated, the sequence of one DNA strand of the genomic region of interest, and one or more pairs of cancer types that are differentially methylated in that genomic region. Since the methylation state of some genomic regions of interest distinguishes between more than one pair of cancer types, each entry identifies a first cancer type and one or more second cancer types as shown in table 3.

Table 3: several SEQ ID NOs recognize target genomic regions that are differentially methylated between paired cancer types

Validation of several genomic regions selected:

some selected genomic regions have been validated by: (1) no reference (using cfDNA in CCGA 130X WGBS database, limited to cfDNA from samples with a log probability ratio greater than 0.9 indicating cancer); or (2) reference (using tissue and WBC samples). FIG. 14 provides verification results based on correctly classified parts (fractions). The results are from (1) results of validation with cfDNA on several genomic regions trained on cfDNA; (2) results of validation with cfDNA on genomic regions trained on all different types of samples used herein; (3) validation results were performed using tissue and WBC gDNA samples in selected genomic regions. The validation data is summarized in table 4, and additionally includes validation data for all samples. Validation results indicate that the genomic regions selected by the methods described herein can provide information for the detection of cancer and various cancer types.

Table 4: verification data

Example 4: generation of a hybrid model classifier

To maximize performance, the predictive cancer model described in this example was trained using sequence data obtained from: several samples of known cancer types and non-cancers from the CCGA sub-study (CCGA1 and CCGA2), several tissue samples of known cancers obtained from CCGA1, and several non-cancer samples from the thrive study (see government clinical trial identification: NCT03085888 (//clinicaltirials. gov/ct2/show/NCT 03085888)). The STRIVE study is a prospective, multicenter observational cohort study to validate a test for early detection of breast cancer and other aggressive cancers, from which additional non-cancer training samples were obtained to train the classifiers described herein. Known cancer types included from the CCGA sample panel include the following: breast cancer, lung cancer, prostate cancer, colorectal cancer, kidney cancer, uterine cancer, pancreatic cancer, esophageal cancer, lymphatic cancer, head and neck cancer, ovarian cancer, liver and gall bladder cancer, melanoma, cervical cancer, multiple myeloma, leukemia, thyroid cancer, bladder cancer, stomach cancer, and anorectal cancer. Thus, a model may be a multiple cancer model (or a multiple cancer classifier) for detecting one or more, two or more, three or more, four or more, five or more, ten or more, or 20 or more different types of cancer.

The classifier performance data shown below was reported for a locked classifier trained on cancer and non-cancer samples obtained from CCGA2, a CCGA sub-study, and non-cancer samples from stride. The several individuals in the CCGA2 sub-study were different from the several individuals in the CCGA1 sub-study, which were used to select several target genomes in the CCGA1 sub-study. From the CCGA2 study, several blood samples were collected from several individuals diagnosed with untreated cancer (including 20 tumor types and all cancer stages) and several healthy individuals without cancer diagnosis (control group). For STRIVE. Several blood samples were collected from several women within 28 days of their scanning mammography. Cell free dna (cfdna) was extracted from each sample and treated with bisulfite to convert unmethylated cytosines to uracil. The bisulfite-treated cfDNA enriches informative cfDNA molecules using a number of hybrid probes designed to enrich bisulfite-converted nucleic acid derived from each of a number of target genomic regions in an assay detection combination including all genomic regions of lists 1 through 16. The enriched bisulfite-converted nucleic acid molecules are sequenced using paired-end sequencing on an Illumina platform (san diego, california) to obtain a set of sequence reads for each of the several training samples, and the resulting pairs of reads are aligned to the reference genome, combined into fragments, and methylated and unmethylated CpG sites are identified.

Hybrid model-based characterization:

for each cancer type (including non-cancer), a probability mixture model is trained and applied to assign a probability to each fragment from each cancer and non-cancer sample based on how likely a fragment is to be observed in a given sample type.

Fragment level analysis:

briefly, for each sample type (cancer and non-cancer samples), for each region (where each region is used as it is if less than 1kb (kilobase), otherwise subdivided into several regions of 1 kilobase length with 50% overlap (e.g., 500 base overlap) between adjacent regions), for each type of cancer and non-cancer, a probability model is fitted to the several fragments derived from the several training samples. The probability model trained for each sample type is a mixture model, where each of the three mixture components is an independent site model in which the methylation at each CpG is assumed to be independent of the methylation at other CpG's. Several fragments were excluded from the model if: the fragments have a P-value greater than 0.01 (from a non-cancer markov model), are labeled as repeat fragments, (for the target methylated sample only) the fragments have a pocket size (bag size) greater than 1, do not cover at least one CpG site, or the fragments are greater than 1000 bases in length. If the plurality of training fragments that are retained overlap with at least one CpG from a region, the plurality of training fragments are assigned to the region. If a fragment overlaps several cpgs in multiple regions, the fragment is assigned to all of the multiple regions.

Local source model:

each probability model is fitted using maximum likelihood estimates to identify a set of parameters that maximizes the log likelihood of all fragments derived from each sample type that are subject to a normalized penalty.

Specifically, in each classification region, a set of probability models is trained, one for each training label (i.e., one for each cancer type and one for non-cancers). Each model takes the form of a bernoulli mixture model having three components. Mathematically, the following steps are carried out:

(1)

wherein n is the number of the mixed components and is set to 3 miE {0, 1} is the observed methylation of the fragment at position i, fkIs to assign a value (f) to the fraction of the component kk≥ 0 and ∑ fk1) and βkiIs the ratio of methylation at CpGi in component k. The product on i includes only locations for which a methylation state can be identified from the ordering. Parameters of each model { fk,βkiThe maximum likelihood value of a training label is maximized by using an RPROP algorithm (e.g., an RPROP algorithm as described in Riedmiller M, Braun H, RPROP: a fast adaptive learning algorithm, proceedings of the International workshop on computer and information science VII, 1992) Is subject to a prior in the form of a distribution of betaki-The total log likelihood of a normalized penalty on the basis of the above is estimated. Mathematically, the maximum magnitude is:

(2)∑jln (Pr (fragment)j|{βki,fk}))+∑k,i r ln(βki(1-βki))

Where r is the normalized intensity, which is set to 1.

Characterization:

once the probability models are trained, a set of digitized features is computed for each sample. Specifically, in each region, for each cancer type and non-cancer sample, several feature pairs are extracted for each segment from each training sample. The extracted features are records of outlier segments (i.e., abnormally methylated segments) defined as segments of log likelihood under a first cancer model that exceed log likelihood under a second cancer model or non-cancer model by at least a threshold tier value (tier value). Several outlier fragments were recorded separately for each genomic region, sample model (i.e., cancer type), and layer (layers 1, 2, 3, 4, 5, 6, 7, 8, and 9), for each sample type, 9 features were harvested for each region. In this way, each feature is defined by three properties: a genomic region, a "positive" cancer type signature (excluding non-cancers), and a layer number selected from the group of {1, 2, 3, 4, 5, 6, 7, 8, 9 }. The numerical value of each feature is defined as the number of segments in the region such that:

(3)Wherein the number of probabilities is defined using the number of values corresponding to the "positive" cancer type (in the numerator of the logarithm) or corresponding to the non-cancer (in the denominator) by equation (1).

Characteristic ranking:

for each set of paired features, the number of features are ranked using mutual information that is based on the ability of the number of feature regions to distinguish the first cancer type (the first cancer type defining the log likelihood model, the features being derived from the log likelihood model) from the second cancer type or from a non-cancer. Specifically, two ranked lists of several features are codified for each unique pair of several category labels: one list has a first label designated "positive" and a second label designated "negative", and the other list has positive/negative designations transposed (except for the "non-cancer label", which is permitted only as a negative label). For each of these ranked lists, only a few features whose positive cancer type label (as in equation (3)) fits the positive label considered are included in the ranking. For each such feature, the proportion of training samples with non-zero feature values is calculated separately for each positive and negative label. This ratio is the larger feature in the positive label, ranked in terms of its mutual information relative to the pair of category labels.

From each pair-wise comparison, the top 256 features were identified and added to the final feature set for each cancer type and non-cancer. To avoid redundancy, if more than one feature is selected from the same positive type and genomic region (i.e., selected for multiple negative types), only the feature assigned the lowest (most informative) ranking for its cancer type pair is retained, breaking layers by selecting higher level values. The several features in the final feature set for each sample (cancer type and non-cancer) are binarized (any feature value greater than 0 is set to 1, leaving all features to be either 0 or 1).

Training a classifier:

the several training samples were then divided into different 5-fold cross validation training sets, and a two-stage classifier was trained on each fold, in each case on 4/5 of the several training samples and using the remaining 1/5 for validation.

In the first stage of training, a binary (two-class) logistic regression model for detecting the presence of cancer is trained to distinguish the several cancer samples (regardless of TOO) from non-cancer samples. When this binary classifier is trained, a sample weight is assigned to the male non-cancer samples to offset the gender imbalance in the training set. For each sample, the binary classifier outputs a prediction score indicating a likelihood of the presence or absence of cancer.

In the second stage of training, a parallel multi-class logistic regression model for determining the tissue from which the cancer originated is trained with the TOO as the target label. Only in the first stage classifier, cancer samples that received a score higher than the 95 th percentile of the non-cancer samples are included in the training of this multi-class classifier. For each cancer sample used in training the multi-class classifier, the multi-class classifier outputs a number of predictors for the classified cancer type, where each predictor is a likelihood that a given sample has a particular cancer type. For example, the cancer classifier can return a prediction of cancer for a test sample that includes a prediction score for breast cancer, a prediction score for lung cancer, and/or a prediction score for no cancer.

Both binary and multi-class classifiers are trained by small batches of random gradient descent (stochastic gradient device), and in each case training is stopped early when performance (as assessed by cross-entropy loss) on the verification fold begins to deteriorate. For predictions on samples outside the training set, the number of scores specified by the five cross-validation classifiers is averaged in each stage. The score assigned to the gender inappropriate cancer type is set to zero and the remaining values are reshaped (renormalized) to sum to one.

A number of scores assigned to the number of validation folds in the training set are retained for use when a cutoff value (threshold) is specified for a particular performance metric criterion. In particular, the probability scores assigned to the training set of non-cancer samples are used to define thresholds corresponding to particular specificity levels. For example, for a desired specificity target of 99.4%, the threshold is set at the 99.4 th percentile of the cross-validated cancer detection probability scores assigned to the non-cancer samples in the training set. Several training samples with a probability score above a threshold are called positive for cancer.

Subsequently, for each training sample determined to be positive for cancer, a TOO or cancer type assessment is made from the multi-class classifier. First, the multi-class logistic regression classifier assigns a set of probability scores to each sample, one probability score for each expected cancer type. Next, the confidence of these scores is evaluated as the difference between the highest and next highest scores assigned to each sample by the multi-class classifier. Next, the lowest threshold is identified using the cross-validated training set scores such that 90% of the cancer samples in the training set whose first two scores differ by more than the threshold are assigned the correct TOO tags as their highest scores. In this manner, the scores assigned to the validation folds during training are further used to determine a second threshold for distinguishing between confidence and uncertainty TOO calls.

Upon prediction, samples that receive a score below the predetermined specified threshold from the binary (stage one) classifier are assigned a "non-cancer" label. For the remaining samples, the samples with the difference in the top two TOO scores from the second stage classifier below a second predefined threshold are assigned an "uncertain cancer" label. The remaining samples are assigned the cancer label for which the TOO classifier assigned the highest score.

Example 5: classifier using target genomic regions of lists 4 to 16

The discriminatory values of the several target genomic regions of tables 4-16 were evaluated by testing the ability of a cancer classifier to detect cancer and any of 20 different cancer types based on the methylation status of the target genomic regions. As shown in table 5, performance was assessed across 1532 cancer samples and 1521 non-cancer samples that were not used to train the classifier. For each sample, differentially methylated cfDNA was enriched using one bait set that included all the genomic regions of interest of tables 1 to 16. The classifier is then narrowed to provide cancer judgment based only on the methylation status of the targeted genomic regions of the list being evaluated.

Table 5: cancer diagnosis of individuals whose cfDNA is used to train classifiers

The classifier performance analysis results of tables 4 to 16 are shown in fig. 15 to 27. In each figure, part a is a Receiver Operator Curve (ROC) showing true positive results and false positive results for cancer or non-cancer. The asymmetric shape of these ROC curves indicates that the classifier is designed to minimize the results of false positives. The area under the curve is tightly packed between 0.78 and 0.83 as shown in table 6. These results indicate that the use of smaller detection combinations (e.g., lists 8, 9, and 13) of less than 1MB does not significantly affect the determination of cancer compared to larger detection combinations (e.g., lists 6 and 6) of greater than 10 MB.

TABLE 6

Target area AUC
List 4 0.81
List 5 0.83
List 6 0.81
List 7 0.83
List 8 0.80
List 9 0.81
Listing 10 0.81
List 11 0.81
List 12 0.81
List 13 0.78
List 14 0.79
List 15 0.80
List 16 0.80

Classifier performance was also evaluated for a subset of the randomly selected target genomic regions in list 4 and list 12, as shown in fig. 28-30 and table 7. Again, the results of the smallest combination of tests (random 10% of list 12, 0.36MB) were similar to the results of the largest test (list 4, 4.63MB), indicating that the methylation status results for at least the vast majority of the target regions in all lists are informative of the presence or absence of cancer.

TABLE 7

Target area AUC
List 4 0.81
Random 50% of Table 4 0.81
List 12 0.81
Random 10% of List 12 0.78
Random 25% of List 12 0.79

An attempt is made to make a determination of the type of cancer (i.e., the TOO) for all of several samples having a cancer determination. The detection combinations B in fig. 15 to 30 show the accuracy of these determinations. For example, the numerical values in the upper right corner of fig. 15B indicate that 151 samples classified as lung cancer according to the methylation status of the several genomic regions of interest in table 4 are from several subjects known to have lung cancer. The value "3" at the left 3 positions in the same confusion matrix indicates that the three samples predicted to have lung cancer are from several subjects who actually have an upper gastrointestinal cancer. In summary, the vast majority of cancer type determinations made using the target genomic region of any of lists 4 through 16 fall on the diagonal of the confusion matrix, indicating that the classifier determined the correct cancer type. Similar results were obtained using randomly selected target genomic regions from list 4 and list 12.

Tables 8 to 23 further summarize the results of these classifiers, and tables 8 to 23 show the accuracy of cancer detection and cancer type determination with a specificity of 0.990, indicating a false positive rate of 1%. These results are described in terms of cancer stage. Cancer detection and cancer type determination are improved in samples from individuals with advanced cancer (e.g., stage IV) compared to samples from individuals with early cancer (e.g., stage I). For all cancer stages (no stage separation), the cancer type determination for all the lists of target genomic regions and several random subsets of lists 4 and 12 is accurate about 90% of the time. For stage i cancer, an accurate cancer type determination is made about 75% of the time. In particular, 75.6% of the cancer type determinations were accurate for the smallest assay detection set (table 8), with only 1370 target genomic regions of total size 395 kb.

The same accuracy results are subdivided by the cancer types in table 24, table 24 showing highly accurate cancer type determination of the target genomic regions for all lists of common cancers (e.g., liver cancer and cholangiocarcinoma), rare cancers (e.g., sarcoma), and difficult-to-detect cancers (e.g., breast cancer).

The sensitivities for detecting 20 different cancer types using the target genomic regions of tables 4-16 or randomly selected portions of tables 4 and 12 are shown in tables 25-40. The specificity of the sensitivity results was 0.990 (false positive rate 1%). Shows sensitivity to all cancers of a specific cancer type as well as first to fourth stage cancers. The sensitivity of advanced cancers is generally higher. For pre-fourth cancer, the sensitivity is greater than 60% for all cancers with more than one sample, and greater than 90% for breast, ovarian, bladder and urothelial, head and neck, colorectal, liver, pancreatic and gall bladder, upper digestive, lymphoma and lung cancers. In stage II, sensitivity to head and neck cancer, liver cancer, pancreatic cancer, and gallbladder cancer, upper digestive tract cancer, lymphoma, and lung cancer is optimal. List 8 is the smallest cohort of target genomic regions that provides a sensitivity to at least 50% of the second stage cancers.

Table 8: the classification accuracy of several genomic regions of table 4 was used. With a specificity of 0.990, the cancer presence and cancer type data show percent accuracy, a 95% confidence interval (in square brackets), and correctly assigned numbers and totals (in parentheses).

Table 9: classification accuracy of several genomic regions using Table 5

Table 10: classification accuracy of several genomic regions using Table 6

Table 11: classification accuracy of several genomic regions using Table 7

Table 12: classification accuracy of several genomic regions using Table 8

Table 13: classification accuracy of several genomic regions using Table 9

Table 14: classification accuracy of several genomic regions using Table 10

Table 15: classification accuracy of several genomic regions using Table 11

Table 16: classification accuracy of several genomic regions using Table 12

Table 17: classification accuracy of several genomic regions using Table 13

Table 18: classification accuracy of several genomic regions using List 14

Table 19: classification accuracy of several genomic regions using Table 15

Table 20: classification accuracy of several genomic regions using List 16

Table 21: classification accuracy using a randomly selected subset of 10% of the several genomic regions of Table 12

Table 22: classification accuracy using a randomly selected subset of 25% of the several genomic regions of Table 12

Table 23: classification accuracy using a randomly selected subset of 50% of the several genomic regions of Table 4

Example 6: cancer detection using cancer assay detection combinations

Several blood samples were collected from a group of individuals who had previously been diagnosed with a cancer of a TOO ("test group"), and other groups of individuals who either did not have cancer or were diagnosed with a different type of cancer ("other group"). cfDNA fragments were extracted from the several blood samples and treated with bisulfite to convert unmethylated cytosines to uracil. The cancer assay detection combination described herein is applied to the several bisulfite-treated samples. Unbound cfDNA fragments are washed and cfDNA fragments bound to the several probes are collected. The collected cfDNA fragments are amplified and sequenced. The number of sequence reads confirm that the number of probes are specifically enriched for cfDNA fragments having a methylation pattern indicative of a cancer of one TOO, with significantly more of the cfDNA fragments of the samples of the test group of differentially methylated cfDNA fragments than from the other groups.

While several preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such several embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.

148页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于早期癌症检测的方法和组合物

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!