Virus-associated cancer risk stratification

文档序号:1894721 发布日期:2021-11-26 浏览:9次 中文

阅读说明:本技术 病毒相关联的癌症风险分层 (Virus-associated cancer risk stratification ) 是由 卢煜明 赵慧君 陈君赐 江培勇 林伟棋 吉璐 于 2020-04-01 设计创作,主要内容包括:本文提供了基于对来自一对象的一生物样本的多个无细胞核酸分子的分析来对所述对象发展出一病原体相关联的疾病的风险进行分层的多种方法和多种系统。在各种示例中,筛选频率是基于风险分析来确定。本文还提供了用于分析多个无细胞核酸分子中一病原体基因组的多个变异模式的多种方法和多种系统。(Methods and systems for stratifying a subject's risk of developing a pathogen-associated disease based on analysis of cell-free nucleic acid molecules from a biological sample of the subject are provided herein. In various examples, the screening frequency is determined based on a risk analysis. Also provided herein are methods and systems for analyzing multiple variation patterns of a pathogen genome in a plurality of cell-free nucleic acid molecules.)

1. A method of screening for a pathogen-associated disease in a subject, the method comprising:

receiving data from a first analysis performed at a first point in time, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a quantity, a methylation status, a pattern of variation, a fragment size, or a relative abundance compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample, and wherein the characteristic is indicative of a risk of the subject developing a disease associated with the pathogen; and

Determining a second analysis to be performed at a second time point based on the features to screen the subject for the pathogen-associated disease, wherein an interval between the first time point and the second time point is inversely correlated with the risk.

2. A method of predicting a pathogen-associated disease in a subject, the method comprising:

receiving data from a first analysis, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a number, a methylation state, a pattern of variation, a fragment size, or a relative abundance compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample; and

based on the characteristics of the plurality of cell-free nucleic acid molecules from the pathogen and one or more of the following: the age of the subject, the smoking habits of the subject, the family history of the subject's pathogen-associated disease, a plurality of genotypic factors of the subject, the race of the subject, or the subject's dietary history, to generate a report indicating a risk of the subject developing a pathogen-associated disease.

3. The method of claim 1, wherein the results of the first analysis do not result in a medical treatment of the subject for the pathogen-associated disease.

4. The method of claim 3, wherein the medical treatment comprises treatment with a plurality of therapeutic agents, radiation treatment, or surgical treatment.

5. The method of claim 1, 3 or 4, wherein the subject is diagnosed as not having the pathogen-associated disease prior to a second time point as determined by a clinical diagnostic test with a false positive rate of less than 1%.

6. The method of claim 5, wherein the clinical diagnostic examination comprises a physical examination, an invasive biopsy, an endoscopy, a magnetic resonance imaging, a positron emission tomography, a computed tomography, or an x-ray imaging.

7. The method of claim 5, wherein the clinical diagnostic test comprises an invasive biopsy comprising a histological analysis, a cytological analysis, or a cellular nucleic acid analysis.

8. The method of any one of claims 1 or 3 to 7, wherein the interval is at least about 2 months, 4 months, 6 months, 8 months, 10 months, or 12 months.

9. The method of claim 8, wherein the interval is at least about 12 months.

10. The method of any one of claims 1 to 9, further comprising performing the first analysis.

11. The method of claim 10, wherein the step of performing the first analysis comprises:

(i) obtaining a first biological sample from the subject; and

(ii) measuring a first number of cell-free nucleic acid molecules from the pathogen in the first biological sample.

12. The method of claim 11, wherein the step of measuring the first quantity comprises: measuring a copy number of the plurality of cell-free nucleic acid molecules from the pathogen in the first biological sample.

13. The method of claim 11 or 12, wherein said measuring comprises Polymerase Chain Reaction (PCR).

14. The method of claim 11 or 12, wherein the measuring comprises quantitative polymerase chain reaction (qPCR).

15. The method of claim 11, wherein the first quantity comprises: measuring a first percentage of the plurality of cell-free nucleic acid molecules from the pathogen in the first biological sample.

16. The method of any of claims 11 to 15, wherein the first analysis further comprises the steps of:

(iii) obtaining a second biological sample from the subject if the first amount is above a threshold, and measuring a second amount of cell-free nucleic acid molecules from the pathogen in the second biological sample.

17. The method of claim 16, wherein the second biological sample is obtained about 4 weeks after the first biological sample.

18. The method of claim 16 or 17, wherein the interval between the first point in time and the second point in time is shorter if both the first number and the second number of copies are above the threshold than if the second number is below an interval of the threshold.

19. A method according to any one of claims 16 to 18, wherein the interval between the first point in time and the second point in time is longer if the first quantity is below the threshold than if the first quantity is above the threshold.

20. The method of any of claims 16 to 19, wherein the interval between the first point in time and the second point in time is about 1 year if both the first amount and the second amount are above the threshold.

21. The method of any one of claims 16 to 20, wherein if the second number is below the threshold value, the interval between the first point in time and the second point in time is about 2 years.

22. The method of any one of claims 16 to 21, wherein if the first number is below the threshold value, the interval between the first point in time and the second point in time is about 4 years.

23. The method of claim 10, wherein said first analysis comprises the steps of:

determining a methylation state of a plurality of cell-free nucleic acid molecules from the pathogen in the biological sample.

24. The method of claim 23, wherein the step of determining the methylation state comprises: treating the plurality of cell-free nucleic acid molecules in the biological sample with a methylation sensitive restriction enzyme or bisulfite.

25. The method of claim 23, wherein the step of determining the methylation state comprises: performing an identifiable methylation sequencing of the plurality of cell-free nucleic acids in the biological sample of the subject.

26. The method of claim 25, wherein said identifiable methylation sequencing comprises bisulfite conversion of unmethylated cytosine to uracil.

27. The method of claim 25, wherein the identifiable methylation sequencing comprises: treatment with a methylation sensitive restriction enzyme.

28. The method of claim 10, wherein said first analysis comprises the steps of:

determining a fragment size distribution of a plurality of cell-free nucleic acid molecules from the pathogen in the biological sample.

29. The method of claim 28, wherein the step of determining the segment size distribution comprises: sequencing a plurality of cell-free nucleic acid molecules in the biological sample, and determining a fragment size of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample based on a plurality of sequence reads mapped to a reference genome of the pathogen.

30. The method of claim 10, wherein said first analysis comprises the steps of:

determining a pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample.

31. The method of claim 30, wherein the step of determining the pattern of variation comprises: sequencing a plurality of cell-free nucleic acid molecules in the biological sample, and determining the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample based on a plurality of sequence reads mapped to a reference genome of the pathogen.

32. The method of claim 30 or 31, wherein the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen comprises a plurality of single nucleotide variations.

33. The method of claim 32, wherein the step of identifying the variant pattern comprises:

determining a level of similarity between sequence reads mapped to the reference genome of the pathogen and a disease-associated reference genome of the pathogen.

34. The method of claim 33, wherein the disease-associated reference genome of the pathogen comprises a genome of the pathogen that is recognized in a diseased tissue.

35. The method of claim 33 or 34, wherein the step of determining a level of similarity comprises:

Isolating the reference genome of the pathogen into a plurality of bins; and

determining a similarity index for each bin of the plurality of bins relative to the disease-related reference genome of the pathogen, wherein the similarity index is associated with a proportion of a plurality of variation sites within a respective bin that has an identical nucleotide variation from at least one of the plurality of sequence reads mapped to the reference genome of the pathogen to the disease-related reference genome of the pathogen.

36. The method of claim 35, wherein said disease-associated reference genome of said pathogen comprises a plurality of disease-associated reference genomes of said pathogen, and wherein said step of determining a level of similarity comprises:

determining a respective similarity index for each of the plurality of bins relative to each of the plurality of disease-associated reference genomes of the pathogen; and

determining a bin score for each bin of the plurality of bins based on a proportion of the plurality of disease-related reference genomes, the respective similarity index within the plurality of bins relative to the bin score being above a cutoff value.

37. The method of claim 35 or 36, wherein each of the plurality of bins has a length of about 100, 200, 300, 400, 500, 600, 700, 800, 900 or 1000 base pairs.

38. The method of any one of claims 10 to 37, wherein the first analysis comprises the steps of: determining the methylation status, the fragment size distribution, or the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample.

39. The method of any of the preceding claims, further comprising: calculating a risk score for the subject to develop the pathogen-associated disease using a classifier applied to a data input, the data input comprising the features of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample, wherein the classifier is configured to apply a function to the data input to generate an output, the data input comprising the features of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample, the output comprising the risk score, the risk score assessing the risk of the subject to develop a disease.

40. The method of claim 39, wherein the classifier is trained using a labeled data set.

41. The method of claim 1, further comprising performing the second analysis at the second point in time.

42. The method of claim 41, wherein the second analysis is the same as the first analysis.

43. The method of claim 41, wherein the second analysis comprises an analysis of the plurality of cell-free nucleic acid molecules from the subject, an invasive biopsy of the subject, an endoscopy of the subject, or a magnetic resonance imaging examination of the subject.

44. A method of analyzing a plurality of nucleic acid molecules from a biological sample of a subject, the method comprising:

obtaining, in a computer system, sequence reads of cell-free nucleic acid molecules from the biological sample of the subject, wherein the biological sample comprises cell-free nucleic acid molecules from the subject and potentially from a pathogen;

aligning, in the computer system, the sequence reads of the cell-free nucleic acid molecules with a reference genome of the pathogen; and

In the computer system, identifying a pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, wherein the pattern of variation characterizes a nucleotide variation mapped to the plurality of sequence reads of the reference genome of the pathogen at each of a plurality of variation sites on the reference genome of the pathogen, wherein the plurality of variation sites comprises at least 30 sites across the reference genome of the pathogen, and the pattern of variation indicates a status or a risk of the pathogen-associated disease in the subject.

45. The method of claim 44, wherein the plurality of variant sites comprises at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, or at least 1200 sites across the reference genome of the pathogen.

46. The method of claim 44, wherein the plurality of variation sites comprises at least 600 sites across the reference genome of the pathogen.

47. The method of claim 44, wherein the plurality of variation sites comprises about 660 sites across the reference genome of the pathogen.

48. The method of claim 44, wherein the plurality of variation sites comprises at least 1000 sites across the reference genome of the pathogen.

49. The method of claim 44, wherein the plurality of variant sites comprises about 1100 sites across the reference genome of the pathogen.

50. The method of claim 44, wherein the plurality of variation sites consists of all sites at which the plurality of sequence reads mapped to the reference genome of the pathogen have a different nucleotide variation from the reference genome of the pathogen.

51. The method of any one of claims 44 to 50, wherein the step of aligning the plurality of sequence reads is configured to allow a maximum mismatch of 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 bases between the plurality of sequence reads mapped to the reference genome of the pathogen and the reference genome of the pathogen.

52. The method of any one of claims 44 to 50, wherein the step of aligning the plurality of sequence reads is configured to allow a maximum mismatch of 2 bases between the reference genome mapped to the pathogen and the reference genome of the pathogen.

53. The method of any one of claims 44 to 52, further comprising:

diagnosing, predicting, or monitoring the pathogen-associated disease of the subject based on the pattern of variation of the plurality of sequence reads mapped to the reference genome of the pathogen.

54. The method of any one of claims 44 to 53, wherein said pattern of variation of said plurality of cell-free nucleic acid molecules from said pathogen comprises a plurality of single nucleotide variations.

55. The method of any one of claims 44 to 54, wherein the step of identifying the variant pattern comprises:

determining a level of similarity between sequence reads mapped to the reference genome of the pathogen and a disease-associated reference genome of the pathogen.

56. The method of claim 55, wherein the disease-associated reference genome of the pathogen comprises a genome of the pathogen that is recognized in a diseased tissue.

57. The method of claim 55 or 56, wherein the step of determining a level of similarity comprises:

isolating the reference genome of the pathogen into a plurality of bins; and

determining a similarity index for each bin of the plurality of bins relative to the disease-related reference genome of the pathogen, wherein the similarity index is associated with a proportion of a plurality of variation sites within a respective bin that has an identical nucleotide variation from at least one of the plurality of sequence reads mapped to the reference genome of the pathogen to the disease-related reference genome of the pathogen.

58. The method of claim 57, wherein said disease-associated reference genome of said pathogen comprises a plurality of disease-associated reference genomes of said pathogen, and wherein said step of determining a level of similarity comprises:

determining a respective similarity index for each of the plurality of bins relative to each of the plurality of disease-associated reference genomes of the pathogen; and

determining a bin score for each bin of the plurality of bins based on a proportion of the plurality of disease-related reference genomes, the respective similarity index within the plurality of bins relative to the bin score being above a cutoff value.

59. The method of claim 58, wherein said cutoff value is about 0.9.

60. The method of any one of claims 57 to 59, wherein each of the plurality of bins has a length of about 100, 200, 300, 400, 500, 600, 700, 800, 900 or 1000 base pairs.

61. The method of any one of claims 44 to 60, further comprising: calculating a risk score for the subject developing the pathogen-associated disease using a classifier applied to a data input, the data input comprising the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, wherein the classifier is configured to apply a function to the data input to generate an output, the data input comprising the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, the output comprising the risk score, the risk score assessing the risk of the subject developing a disease.

62. The method of claim 61, wherein the classifier is trained using a labeled data set.

63. The method of claim 61 or 62, wherein the classifier comprises a mathematical model using a naive Bayes model, logistic regression, random forest, decision tree, gradient boosting tree, neural network, deep learning, linear/kernel Support Vector Machine (SVM), linear/non-linear regression, or linear discriminant analysis.

64. The method of any one of claims 44 to 63, wherein the pathogen is a virus.

65. The method of claim 64, wherein the virus is Epstein-Barr virus (EBV).

66. The method of claim 65, wherein said pathogen-associated disease comprises nasopharyngeal carcinoma, NK cell lymphoma, Burkitt's lymphoma, post-transplant lymphoproliferative disorder, or Hodgkin's lymphoma.

67. The method of claim 65 or 66, wherein the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites, the plurality of variation sites comprising at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites selected from a plurality of genomic sites listed in Table 6 relative to an EBV reference genome (AJ 507799.2).

68. The method of claim 67, wherein the plurality of mutation sites comprises a genomic site as listed in Table 6 relative to an EBV reference genome (AJ 507799.2).

69. The method of claim 65 or 66, wherein the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites randomly selected from a plurality of genomic sites listed in Table 6 relative to an EBV reference genome (AJ 507799.2).

70. The method of claim 65 or 66, wherein the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites, the plurality of variation sites comprising at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites randomly selected from a plurality of genomic sites listed in Table 6 relative to an EBV reference genome (AJ 507799.2).

71. The method of claim 64, wherein said virus is a Human Papilloma Virus (HPV).

72. The method of claim 71, wherein said pathogen-associated disease comprises cervical cancer, oropharyngeal cancer, or head and neck cancer.

73. The method of claim 64, wherein said virus is a Hepatitis B Virus (HBV).

74. The method of claim 73, wherein the pathogen-associated disease comprises cirrhosis or hepatocellular carcinoma (HCC).

75. The method of any one of claims 44 to 74, wherein said pattern of variation indicates a status of said pathogen-associated disease in said subject, wherein said status of said pathogen-associated disease comprises a presence of said pathogen-associated disease in said subject, a quantity of a tumor tissue in said subject, a size of a tumor tissue in said subject, a stage of a tumor in said subject, a tumor burden in said subject, or a presence of tumor metastasis in said subject.

76. The method of any one of claims 44 to 74, wherein the biological sample is selected from the group consisting of: whole blood, plasma, serum, urine, cerebrospinal fluid, buffy coat, vaginal fluid, vaginal irrigation fluid, saliva, oral irrigation fluid, nasal irrigation fluid, a nasal brush sample, and combinations thereof.

77. A non-transitory computer-readable medium containing machine-executable code that, when executed by one or more computer processors, performs the method of any one of claims 1-76.

78. A computer product comprising a computer-readable medium storing instructions for controlling a computer system to perform operations of a method according to any one of claims 1 to 76.

79. A system, characterized in that the system comprises:

the computer product of claim 78; and

one or more processors configured to execute a plurality of instructions stored on the computer-readable medium.

Background

Many diseases and conditions may be associated with infection by a variety of pathogens (e.g., viruses). Nasopharyngeal carcinoma (NPC) is one of the most common cancers in southern china and southeast asia, and the pathogenesis of nasopharyngeal carcinoma is closely related to Epstein-Barr virus (EBV) infection. In areas of high incidence of nasopharyngeal carcinoma, nearly all nasopharyngeal carcinoma tumors contain the EBV genome. Based on the close relationship between EBV and nasopharyngeal carcinoma, plasma EBV-DNA has developed as a biomarker for nasopharyngeal carcinoma. The plasma EBV DNA assay was 95% sensitive and 93% specific for the detection of nasopharyngeal carcinoma using real-time Polymerase Chain Reaction (PCR) analysis (Lo et al, J.C., 1999; 59: 1188-91). The development of non-invasive or minimally invasive diagnostic assays to stratify the risk of diseases associated with pathogens in biological samples based on the analysis of multiple cell-free nucleic acid molecules of these pathogens can be of significant clinical benefit.

Disclosure of Invention

In some aspects, provided herein is a method of screening for a pathogen-associated disease in a subject, the method comprising: receiving data from a first analysis performed at a first point in time, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a number, a methylation state, a pattern of variation, a fragment size, or a relative abundance compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample, wherein the characteristic is indicative of a risk of the subject developing a disease associated with the pathogen; and determining a second analysis to be performed at a second time point based on the features to screen the subject for the pathogen-associated disease, wherein an interval between the first time point and the second time point is inversely related to the risk.

In some aspects, provided herein is a method of predicting a pathogen-associated disease in a subject, the method comprising: receiving data from a first analysis, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a number, a methylation state, a pattern of variation, a fragment size, or a relative abundance compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample; and based on the characteristics of the plurality of cell-free nucleic acid molecules from the pathogen and one or more of the following: the age of the subject, the smoking habits of the subject, the family history of the subject's pathogen-associated disease, a plurality of genotypic factors of the subject, the race of the subject, or the subject's dietary history, to generate a report indicating a risk of the subject developing a pathogen-associated disease.

In some cases, the results of the first analysis do not result in a medical treatment of the subject for the pathogen-associated disease. In certain instances, the medical treatment comprises treatment with a plurality of therapeutic agents, radiation therapy, or surgical treatment. In some cases, the subject is diagnosed as not having the pathogen-associated disease prior to a second time point as determined by a clinical diagnostic test with a false positive rate of less than 1%. In some cases, the clinical diagnostic examination comprises a physical examination, an invasive biopsy, an endoscopy, a magnetic resonance imaging, a positron emission tomography, a computed tomography, or an x-ray imaging. In some cases, the clinical diagnostic test comprises an invasive biopsy comprising a histological analysis, a cytological analysis, or a cellular nucleic acid analysis. In some cases, the interval is at least about 2 months, 4 months, 6 months, 8 months, 10 months, or 12 months. In some cases, the interval is at least about 12 months.

In some cases, the method also includes performing the first analysis. In some cases, the step of performing the first analysis includes: (i) obtaining a first biological sample from the subject; and (ii) measuring a first number of cell-free nucleic acid molecules from the pathogen in the first biological sample. In some cases, the step of measuring the first quantity comprises: measuring a copy number of the plurality of cell-free nucleic acid molecules from the pathogen in the first biological sample. In some cases, the measuring comprises Polymerase Chain Reaction (PCR). In some cases, the measuring comprises quantitative polymerase chain reaction (qPCR). In some cases, the first quantity comprises: measuring a first percentage of the plurality of cell-free nucleic acid molecules from the pathogen in the first biological sample. In some cases, the first analysis further comprises: (iii) obtaining a second biological sample from the subject if the first amount is above a threshold, and measuring a second amount of cell-free nucleic acid molecules from the pathogen in the second biological sample. In some cases, the second biological sample is obtained about 4 weeks after the first biological sample. In some cases, the interval between the first point in time and the second point in time is shorter if both the first number and the second number of copies are above the threshold than if the second number is below the threshold. In some cases, the interval between the first point in time and the second point in time is longer if the first number is below the threshold than if the first number is above the threshold. In some cases, the interval between the first point in time and the second point in time is about 1 year if both the first amount and the second amount are above the threshold. In some cases, the interval between the first point in time and the second point in time is about 2 years if the second number is below the threshold. In some cases, the interval between the first point in time and the second point in time is about 4 years if the first number is below the threshold. In some cases, the first analysis comprises: determining a methylation state of a plurality of cell-free nucleic acid molecules from the pathogen in the biological sample. In some cases, the step of determining the methylation state comprises: treating the plurality of cell-free nucleic acid molecules in the biological sample with a methylation sensitive restriction enzyme or bisulfite. In some cases, the step of determining the methylation state comprises: performing an identifiable methylation sequencing of the plurality of cell-free nucleic acids in the biological sample of the subject. In some cases, the identifiable methylation sequencing comprises a bisulfite conversion of unmethylated cytosine to uracil. In some cases, the identifiable methylation sequencing comprises: treatment with a methylation sensitive restriction enzyme. In some cases, the first analysis comprises: determining a fragment size distribution of a plurality of cell-free nucleic acid molecules from the pathogen in the biological sample. In some cases, the step of determining the fragment size distribution comprises: sequencing a plurality of cell-free nucleic acid molecules in the biological sample, and determining a fragment size of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample based on a plurality of sequence reads mapped to a reference genome of the pathogen.

In some cases, the first analysis comprises: determining a pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample. In some cases, the step of determining the pattern of variation comprises: sequencing a plurality of cell-free nucleic acid molecules in the biological sample, and determining the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample based on a plurality of sequence reads mapped to a reference genome of the pathogen. In certain instances, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen comprises a plurality of single nucleotide variations. In some cases, the step of identifying the variant pattern comprises: determining a level of similarity between sequence reads mapped to the reference genome of the pathogen and a disease-associated reference genome of the pathogen. In some cases, the disease-associated reference genome of the pathogen comprises a genome of the pathogen that is recognized in a diseased tissue. In some cases, the step of determining a level of similarity comprises: isolating the reference genome of the pathogen into a plurality of bins; and determining a similarity index for each bin of the plurality of bins relative to the disease-related reference genome of the pathogen, wherein the similarity index is associated with a proportion of a plurality of variation sites within a respective bin, at least one of the plurality of sequence reads in the respective bin that maps to the reference genome of the pathogen having a same nucleotide variation as the disease-related reference genome of the pathogen. In some cases, the disease-associated reference genome of the pathogen comprises a plurality of disease-associated reference genomes of the pathogen, and the step of determining a level of similarity comprises: determining a respective similarity index for each of the plurality of bins relative to each of the plurality of disease-associated reference genomes of the pathogen; and determining a bin score for each bin of the plurality of bins based on a proportion of the plurality of disease-related reference genomes, the respective similarity index within the plurality of bins relative to the bin score being above a cutoff value. In some cases, each of the plurality of bins has a length of about 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000 base pairs. In some cases, the first analysis comprises the steps of: determining the methylation status, the fragment size distribution, or the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample.

In some cases, the method further comprises: calculating a risk score for the subject to develop the pathogen-associated disease using a classifier applied to a data input, the data input comprising the features of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample, wherein the classifier is configured to apply a function to the data input to generate an output, the data input comprising the features of the plurality of cell-free nucleic acid molecules from the pathogen in the biological sample, the output comprising the risk score, the risk score assessing the risk of the subject to develop a disease. In some cases, the classifier is trained using a labeled data set.

In some cases, the method further includes performing the second analysis at the second point in time. In some cases, the second analysis is the same as the first analysis. In some cases, the second analysis comprises an analysis of the plurality of cell-free nucleic acid molecules from the subject, an invasive biopsy of the subject, an endoscopic examination of the subject, or a magnetic resonance imaging examination of the subject.

In some aspects, provided herein is a method of analyzing a plurality of nucleic acid molecules from a biological sample of a subject, the method comprising: obtaining, in a computer system, sequence reads of cell-free nucleic acid molecules from the biological sample of the subject, wherein the biological sample comprises cell-free nucleic acid molecules from the subject and potentially from a pathogen; aligning, in the computer system, the sequence reads of the cell-free nucleic acid molecules with a reference genome of the pathogen; and identifying, in the computer system, a pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, the pattern of variation characterizing a nucleotide variation mapped to the plurality of sequence reads of the reference genome of the pathogen at each of a plurality of variation sites on the reference genome of the pathogen, wherein the plurality of variation sites comprises at least 30 sites across the reference genome of the pathogen, and the pattern of variation is indicative of a status or a risk of the pathogen-associated disease in the subject.

In certain instances, the plurality of variant sites comprises at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, or at least 1200 sites across the reference genome of the pathogen. In certain instances, the plurality of variant sites comprises at least 600 sites across the reference genome of the pathogen. In certain instances, the plurality of variant sites comprises at least 660 sites across the reference genome of the pathogen. In certain instances, the plurality of variant sites comprises at least 1000 sites across the reference genome of the pathogen. In some cases, the plurality of variant sites comprises about 1100 sites across the reference genome of the pathogen. In some cases, the plurality of variation sites consists of all sites at which the plurality of sequence reads mapped to the reference genome of the pathogen have a different nucleotide variation from the reference genome of the pathogen. In some cases, the step of aligning the plurality of sequence reads is configured to allow a maximum mismatch of 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 bases between a plurality of sequence reads mapped to the reference genome of the pathogen and the reference genome of the pathogen. In some cases, the step of aligning the plurality of sequence reads is configured to allow a maximum mismatch of 2 bases between a plurality of sequence reads mapped to the reference genome of the pathogen and the reference genome of the pathogen. In some cases, the method further comprises: diagnosing, predicting, or monitoring the pathogen-associated disease of the subject based on the pattern of variation of the plurality of sequence reads mapped to the reference genome of the pathogen. In certain instances, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen comprises a plurality of single nucleotide variations. In some cases, the step of identifying the variant pattern comprises: determining a level of similarity between sequence reads mapped to the reference genome of the pathogen and a disease-associated reference genome of the pathogen. In some cases, the disease-associated reference genome of the pathogen comprises a genome of the pathogen that is recognized in a diseased tissue. In some cases, the step of determining a level of similarity comprises: determining a similarity index for each bin of the plurality of bins relative to the disease-related reference genome of the pathogen, wherein the similarity index is associated with a proportion of a plurality of variation sites within a respective bin that has an identical nucleotide variation from at least one of the plurality of sequence reads mapped to the reference genome of the pathogen to the disease-related reference genome of the pathogen. In some cases, the disease-associated reference genome of the pathogen comprises a plurality of disease-associated reference genomes of the pathogen, and the step of determining a level of similarity comprises: determining a respective similarity index for each of the plurality of bins relative to each of the plurality of disease-associated reference genomes of the pathogen; and determining a bin score for each bin of the plurality of bins based on a proportion of the plurality of disease-related reference genomes, the respective similarity index within the plurality of bins relative to the bin score being above a cutoff value. In some cases, the cutoff value is about 0.9. In some cases, each of the plurality of bins has a length of about 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000 base pairs. In some cases, the method further comprises: calculating a risk score for the subject developing the pathogen-associated disease using a classifier applied to a data input, the data input comprising the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, wherein the classifier is configured to apply a function to the data input to generate an output, the data input comprising the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen, the output comprising the risk score, the risk score assessing the risk of the subject developing a disease. In some cases, the classifier is trained using a labeled data set. In some cases, the classifier includes a mathematical model that uses a naive bayes model, logistic regression, random forest, decision tree, gradient boosting tree, neural network, deep learning, linear/kernel Support Vector Machine (SVM), linear/nonlinear regression, or linear discriminant analysis.

In some cases, the pathogen is a virus. In some cases, the virus is epstein-barr virus (EBV). In some cases, the pathogen-associated disease comprises nasopharyngeal carcinoma, NK cell lymphoma, burkitt's lymphoma, post-transplant lymphoproliferative disorder, or hodgkin's lymphoma. In certain instances, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of a plurality of variation sites comprising at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites selected from a plurality of genomic sites listed in table 6 relative to an EBV reference genome (AJ 507799.2). In some cases, the plurality of mutation sites comprises a genomic site relative to an EBV reference genome (AJ507799.2) as set forth in table 6. In certain instances, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites randomly selected from a plurality of genomic sites listed in table 6 relative to an EBV reference genome (AJ 507799.2). In certain instances, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites, the plurality of variation sites comprising at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites randomly selected from a plurality of genomic sites listed in table 6 relative to an EBV reference genome (AJ 507799.2).

In some cases, the virus is a Human Papilloma Virus (HPV). In some cases, the pathogen-associated disease comprises cervical cancer, oropharyngeal cancer, or head and neck cancer. In some cases, the virus is a Hepatitis B Virus (HBV). In some cases, the pathogen-associated disease comprises a cirrhosis or a hepatocellular carcinoma (HCC). In some cases, the pattern of variation indicates a status of the pathogen-associated disease in the subject, the status of the pathogen-associated disease comprising a presence of the pathogen-associated disease in the subject, a quantity of a tumor tissue in the subject, a size of a tumor tissue in the subject, a stage of a tumor in the subject, a tumor burden in the subject, or a presence of tumor metastasis in the subject. In some cases, the biological sample is selected from the group consisting of: whole blood, plasma, serum, urine, cerebrospinal fluid, buffy coat, vaginal fluid, vaginal irrigation fluid, saliva, oral irrigation fluid, nasal irrigation fluid, a nasal brush sample, and combinations thereof.

In some aspects, provided herein is a non-transitory computer-readable medium containing machine-executable code that, when executed by one or more computer processors, performs any of the above-described methods.

In some aspects, provided herein is a computer product comprising a computer-readable medium storing a plurality of instructions for controlling a computer system to perform the operations of any of the above-described methods.

In some aspects, provided herein is a system comprising: a computer product as described herein; and one or more processors configured to execute a plurality of instructions stored on the computer-readable medium.

In some aspects, provided herein is a system comprising: apparatus for performing any of the above-described methods.

In some aspects, provided herein is a system configured to perform any of the above-described methods.

In some aspects, provided herein is a system comprising a plurality of modules that respectively perform the steps of any of the methods described above.

Reference is made to:

all publications, patents, and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.

Drawings

The novel features believed characteristic of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages described herein may be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles described herein are utilized, and the accompanying drawings of which:

FIG. 1 is a schematic design of a nasopharyngeal carcinoma (NPC) screening study on a population of more than 20000 subjects.

Fig. 2 shows an exemplary schematic diagram of a nasopharyngeal carcinoma screening protocol according to the present disclosure.

FIG. 3 summarizes phylogenetic tree analyses based on EBV variant gene profiles from samples of nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects.

FIG. 4 summarizes phylogenetic tree analysis of EBV variant gene profiles based on samples from nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects (containing no 29 reported variants).

FIG. 5 summarizes phylogenetic tree analyses based on EBV variant gene profiles from samples from nasopharyngeal carcinoma patients, non-nasopharyngeal carcinoma subjects, and pre-nasopharyngeal carcinoma subjects.

FIG. 6 summarizes phylogenetic tree analyses of EBV variant gene profiles based on samples of nasopharyngeal carcinoma patients, non-nasopharyngeal carcinoma subjects, and pre-nasopharyngeal carcinoma subjects (containing no 29 reported variants).

Fig. 7 illustrates the principle of block-based (block-based) mutation pattern analysis.

FIG. 8 summarizes block-based analysis of EBV DNA mutation patterns in 13 nasopharyngeal carcinoma, 16 non-nasopharyngeal carcinoma, and 4 pre-nasopharyngeal carcinoma samples.

FIG. 9 summarizes block-based analysis of EBV DNA variation patterns for 13 nasopharyngeal carcinoma, 16 non-nasopharyngeal carcinomas, and 4 pre-nasopharyngeal carcinoma (excluding 29 reported variations).

Fig. 10A shows nasopharyngeal cancer risk scores calculated using a trained classifier based on analysis of all EBV variants using block-based variant analysis. Figure 10B shows nasopharyngeal carcinoma risk scores calculated using a trained classifier based on analysis of 29 reported EBV variations. Figure 10C shows nasopharyngeal carcinoma risk scores calculated using a trained classifier based on analysis of all EBV variants using block-based variant analysis (but excluding 29 reported variants).

FIG. 11 summarizes the methylation levels of nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects with EBV DNA transient positive or persistent positive.

Fig. 12 is a schematic diagram showing the size change of plasma DNA of a non-cancer subject having plasma EBV DNA positivity induced by methylation sensitive enzyme digestion. Filled and unfilled lollipop plots (lollipops) represent methylated and unmethylated CpG sites, respectively. Yellow horizontal bars represent plasma EBV DNA molecules. As the enzyme digests, the size distribution shifts to the left.

FIG. 13 is a graph showing the size change of plasma DNA of nasopharyngeal carcinoma patients with EBV DNA positive induced by methylation sensitive enzyme digestion. Filled and unfilled lollipop plots (lollipops) represent methylated and unmethylated CpG sites, respectively. Yellow horizontal bars represent plasma EBV DNA molecules. As the enzyme digests, the size distribution shifts to the left.

FIG. 14 shows the size gene profile of plasma EBV DNA digested with or without computer simulation (in-silico) using the methylation sensitive restriction enzyme HpaII.

FIG. 15 shows the cumulative size distribution of plasma EBV DNA digested with or without methylation sensitive restriction enzyme for nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects.

FIG. 16A is a schematic showing three hypothetical sites a, B, and C associated with nasopharyngeal carcinoma in a training set of 661 SNV sites in the EBV genome. The nasopharyngeal cancer risk score of the test sample is determined from multiple genotype patterns of a subset of 661 SNV sites covered by multiple plasma EBV DNA reads (e.g., with available genotype information). From the plasma sequencing data of the test samples, the genotype information only applies to sites a and C, but not to site B, since site B is not covered by any sequencing EBV DNA reads. FIG. 16B is a schematic representation showing genotype weights for sites A and C by analyzing the genotypes at 2 sites of all 63 nasopharyngeal carcinoma samples and 88 non-nasopharyngeal carcinoma samples in the training set. Logistic regression models were established to provide information on the weights of the high risk genotypes at the a and C sites. FIG. 16C is a schematic diagram showing the process of deriving the NPC risk score for a test sample based on the genotypes of the A and C loci, weighted by the corresponding coefficients derived from the training model. FIG. 16D shows the distribution of 5678 SNVs in the EBV genome in the training set of nasopharyngeal and non-nasopharyngeal carcinoma samples (showing the total number of variations in a sliding window of 1000 nucleotides in the EBV genome).

FIGS. 17A and 17B are graphs summarizing nasopharyngeal carcinoma risk scores in a training set using the one-out method. Figure 17A shows nasopharyngeal carcinoma risk scores for nasopharyngeal carcinoma and non-nasopharyngeal carcinoma plasma samples in training sets. FIG. 17B shows ROC curve analysis of nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples analyzed for differences by nasopharyngeal carcinoma risk score.

FIGS. 18A and 18B are graphs summarizing nasopharyngeal cancer risk scores in test sets. FIG. 18A shows nasopharyngeal carcinoma risk scores for plasma samples tested for concentrated nasopharyngeal carcinoma and non-nasopharyngeal carcinoma. FIG. 18B shows ROC curve analysis of nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples analyzed for differences by nasopharyngeal carcinoma risk score.

FIGS. 19A and 19B are graphs summarizing nasopharyngeal cancer risk analysis by analyzing multiple genotype patterns of the EBER region. FIG. 19A shows nasopharyngeal carcinoma risk scores for plasma samples tested for focused nasopharyngeal carcinoma and non-nasopharyngeal carcinoma by analyzing multiple genotype patterns of EBER regions. FIG. 19B shows ROC curve analysis of nasopharyngeal carcinoma and non-nasopharyngeal carcinoma sample differences based on EBER regional nasopharyngeal carcinoma risk score analysis.

Fig. 20A and 20B are graphs summarizing nasopharyngeal cancer risk by analyzing multiple genotype patterns of BALF2 region. Figure 20A shows nasopharyngeal carcinoma risk scores for plasma samples tested for concentrated nasopharyngeal carcinoma and non-nasopharyngeal carcinoma by analyzing multiple genotype patterns of BALF2 region. Figure 20B shows ROC curve analysis of nasopharyngeal carcinoma and non-nasopharyngeal carcinoma sample differences based on BALF2 regional nasopharyngeal carcinoma risk score analysis.

FIG. 21 shows a computer control system that can be programmed or otherwise configured to carry out the various methods provided herein.

Fig. 22 shows a schematic diagram of various methods and various systems disclosed herein.

Detailed Description

SUMMARY

In various aspects, methods and systems are provided for screening a subject for a pathogen-associated disease. The various methods and systems can provide an assessment of the risk of the subject developing a disease associated with the pathogen based on characteristics of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject. In other methods and systems, risk prediction may determine an appropriate screening frequency. Proper and timely follow-up screening can not only save the cost of the subject, but also early disease detection. For example, the metastatic spread of EBV-nasopharyngeal carcinoma (EBV-NPC) to an earlier stage can significantly improve the progression-free survival of patients with nasopharyngeal carcinoma.

A subject's risk of developing a pathogen-associated disease may refer to the likelihood that the subject is predisposed to developing a pathogen-associated disease. In certain instances, risk as described herein refers to the likelihood of a pathogen-associated disease in a subject developing a state that can be clinically detected at a future point in time ("clinically detectable disease"). In certain instances, a subject is screened at a first time point by a screening assay that tests a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, and when the subject is diagnosed as not having a clinically detectable pathogen-associated disease at the first time point, a characteristic of the plurality of cell-free nucleic acid molecules of the pathogen in the biological sample of the subject may indicate that the subject is at risk of a clinically detectable disease at a future time point.

Clinically detectable disease refers to a disease that exhibits symptoms of multiple conditions, and can be detected by one or more established clinical diagnostic tests. In certain instances, a clinical diagnostic test of maturity comprises a medical test/analysis of low false positive detection rate of pathogen-associated disease, e.g., less than 30%, 20%, 10%, 8%, 7%, 6%, 5%, 4%, 3%, 2.5%, 2%, 1%, 0.8%, 0.5%, 0.25%, 0.15%, 0.1%, 0.08%, 0.05%, 0.02%, 0.01%, 0.005%, 0.002%, 0.001%, or even lower. Mature clinical diagnostic tests, which comprise a variety of medical tests/assays, may also have a high sensitivity to detect pathogen-associated diseases, e.g., at least 30%, 40%, 50%, 60%, 70%, 80%, 85%, 90%, 92%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 100%. In some cases, the pathogen-associated disease is a pathogen-associated proliferative disease, such as cancer, that can be clinically diagnosed with high confidence and low false positive rates by one or more invasive biopsies followed by histological or other examination of the biopsy tissue (e.g., histological, cytological, such as cellular DNA or protein analysis), imaging examination, such as X-ray, Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET) or Computed Tomography (CT) or positron emission computed tomography (PET-CT), laboratory examination (such as blood or urine examination), or physical examination). Diagnosis of pathogen-associated diseases can be made by certified physicians based on the above or other established clinical findings. In some cases, the results of the first screening assay do not result in medical treatment of a pathogen-associated disease by the subject, as the subject is diagnosed as not having the disease by a mature clinical diagnostic test.

Based on the assessed risk, in some cases, the methods comprise determining a plurality of screening analysis frequencies associated with the pathogen in the subject. The frequency of screening assays may be correlated with risk, and the interval between two screening assays (e.g., a screening assay described herein and a subsequent screening assay) may be inversely correlated with risk. In some cases, the method includes receiving data from a first screening analysis performed at a first point in time. The first screening assay may comprise determining a characteristic of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject. For example, the first screening assay comprises a biological sample obtained from a subject, and the biological sample comprises a plurality of cell-free nucleic acid molecules (e.g., cell-free DNA) from the subject (and possibly from the pathogen). The first screening assay can further comprise determining a characteristic of a plurality of cell-free nucleic acid molecules of the pathogen in the biological sample. Non-limiting characteristics of the plurality of cell-free nucleic acid molecules from a pathogen used in the various methods and systems provided herein include the number (e.g., copy number or percentage), methylation state, fragment size, variation pattern, relative abundance compared to the plurality of cell-free nucleic acid molecules of the subject in the biological sample. As described herein, reference to a biological sample from a subject or a point in time at which a subject performs an examination or analysis may refer to a point in time at which the subject receives an examination or a point in time at which a biological sample is obtained from the subject, rather than a point in time at which an actual analysis is performed on the biological sample.

In certain instances, methods provided herein comprise (a) receiving data from a first analysis performed at a first point in time, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample from the subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a number (e.g., copy number or percentage), a methylation state, a pattern of variation, a fragment size, or a relative abundance compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample, wherein the characteristic is indicative of a risk of the subject developing a disease associated with the pathogen; and (b) determining a second analysis to be performed at a second time point based on the features to screen the subject for the pathogen-associated disease, wherein an interval between the first time point and the second time point is inversely correlated with the risk.

In certain instances, one or more characteristics of a plurality of cell-free nucleic acid molecules in a biological sample of a subject as described herein enable non-invasive methods to be employed to assess the status of a pathogen-associated disease (e.g., cancer) in a subject or the risk of the subject developing a pathogen-associated disease in the future. Without wishing to be bound by certain theories, there may be at least two possible scenarios that are the basis for an association between one or more characteristics of a plurality of cell-free nucleic acid molecules that may be used in a variety of methods and a variety of systems and a subject's risk of developing a pathogen-associated disease. In one possible scenario, diseased tissue that is suffering from a pathogen-associated disease (e.g., a pathogen-associated tumor) may already be present at the time of the initial screening (e.g., the first screening assay). However, the size of diseased tissue (e.g., tumors) may be too small to be detected by other traditional medical examination methods (e.g., methods that detect pathogen-associated diseases with false positive rates of less than 10%, 5%, 2%, 1%, 0.5%, 0.1%, or 0.05%, such as endoscopy and Magnetic Resonance Imaging (MRI)). For example, as the disease progresses, growth of diseased tissue (e.g., tumor, in size) may be detected in a subsequent screen (second screening assay) (more advanced diseased tissue, e.g., enlarged tissue (e.g., enlarged tumor)). Another possible scenario is: multiple nucleic acid molecules of a pathogen, such as EBV DNA, can be released by cells in an initial diseased state, e.g., precancerous cells, which may subsequently develop into diseased cells, e.g., cancer cells. Regardless of the exact circumstances behind the association, the subject described herein may be used to stratify a plurality of subjects who are then at risk of developing clinically detectable nasopharyngeal carcinoma.

In some cases, the actual time intervals used by the particular screening programs described herein are adjusted based on economic health considerations (e.g., screening costs), subject preferences (e.g., more frequent screening intervals may be more disruptive to the lifestyle of certain subjects), and other clinical parameters (e.g., multiple genotypes (e.g., HLA status) of the individual (Bei et al, Nat Genet journal, 2010; 42: 599-.

In some cases, various methods provided herein comprise: receiving data from a first analysis, the first analysis comprising determining a characteristic of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample of a subject, wherein the characteristic of the plurality of cell-free nucleic acid molecules from the pathogen comprises a number (e.g., copy number or percentage), a methylation state, a pattern of variation, a fragment size, coordinates of ends of a plurality of fragments, a sequence motif of ends of a plurality of fragments, or a relative abundance as compared to the plurality of cell-free nucleic acid molecules from the subject in the biological sample; and based on the characteristics of the plurality of cell-free nucleic acid molecules from the pathogen and one or more of the following: the age of the subject, the smoking habits of the subject, the family history of the subject's pathogen-associated disease, the subject's genotypic factors, or the subject's dietary history generate a report indicating a risk of the subject developing pathogen-associated disease.

In various aspects, provided herein are various methods and various systems for analyzing a plurality of nucleic acid molecules in a biological sample from a subject. Examples of the various methods and various systems may involve analyzing a variation pattern of a plurality of nucleic acid molecules from a pathogen in a biological sample. In some cases, the plurality of nucleic acid molecules of the pathogen in the biological sample comprises a plurality of cell-free nucleic acid molecules. Variation pattern analysis may involve comparison of sequences of a plurality of nucleic acid molecules in a biological sample, the sequences of the plurality of nucleic acid molecules in the biological sample being identified as originating from a pathogen having one or more reference genomes, and subsequently determining a pattern of nucleotide variation in the plurality of nucleic acid molecules from the pathogen in the biological sample.

In some cases, the methods and systems provided herein include determining a status or risk of a pathogen-associated disease in a subject based on a pattern of variation in a plurality of nucleic acid molecules from the pathogen in a biological sample. For example, genetic variation of the EBV genome detected in plasma can be used to predict the risk of future nasopharyngeal carcinoma development. It has been previously reported that the presence of EBV strains (Palser et al, J Virol J2015; 89:5222-37) in EBV-associated tumors and control samples from different geographic locations may vary. Given the geographic variation of EBV variation, it is difficult to tell whether the variations identified in a tumor sample are geographically or disease related.

In certain instances, the variation pattern analysis described herein involves a genome-wide (genolowide) comparison between a plurality of nucleic acid molecules from a pathogen and one or more reference genomes of the pathogen in a biological sample. Genome-wide comparisons may involve sequence alignment across the entire genome of the pathogen and subsequent clustering of nucleotide variation patterns. In some cases, genome-wide comparisons involve analysis of nucleotide variations at a large number of sites in a reference genome spanning the pathogen. These sites may comprise all sites in the entire genome spanning the pathogen. Alternatively, the sites or variation sites across the reference genome of the pathogen may comprise at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, at least 1200, at least 1300, at least 1400, at least 1500, at least 1600, at least 1700, at least 1800, at least 1900, at least 2000, at least 3000, at least 4000, or at least 5000 sites of nucleotide variation that may typically be found. The nucleotide variations described herein may comprise a plurality of Single Nucleotide Variations (SNVs). The plurality of variation sites provided herein for variation pattern analysis can comprise typical Single Nucleotide Variations (SNVs) identified in the genome of the pathogen. In some cases, multiple sites of variation may comprise multiple insertions, multiple deletions, and multiple fusions.

The genome-wide variation pattern analysis provided herein may be superior to individual multiple Single Nucleotide Polymorphism (SNPs) analysis. In exemplary cases, while SNPs on a fixed number of loci may be associated with a particular line(s) or subtype(s) of a pathogen that may cause a pathology in a subject, risk assessment based on analysis of these individual SNPs may be limited to the particular line(s) or subtype(s) of the pathogen, and may not provide an accurate risk assessment if other pathogenic line(s) or subtype(s) of a pathogen are present. In another exemplary case, the genome-wide variation pattern analysis provided herein is beneficial when the pathogen nucleic acid molecules in a biological sample are rare, for example, when analyzing a plurality of cell-free nucleic acid molecules in a biological sample such as plasma. Pathogen nucleic acid molecules available in biological samples may not cover a significant amount of the pathogen genome. Thus, a genome-wide variation pattern analysis involving a large number of variation sites across the entire genome of the pathogen can provide a relatively more comprehensive readout of the genotype characteristics of a plurality of cell-free nucleic acid molecules of the pathogen in a biological sample, whereas an analysis involving a fixed number of individual polymorphisms is limited to a relatively small region or regions of the genome, and thus can provide a relatively limited readout of the genotype characteristics of a plurality of cell-free nucleic acid molecules of the pathogen in a biological sample.

In some cases, the variant pattern analysis provided herein comprises a block-based pattern analysis that involves stratifying a reference genetic component of a pathogen into a plurality of bins (bins) and analyzing a plurality of sequence reads relative to each bin of the plurality of bins. In some cases, the method includes: determining a respective similarity index for each of the plurality of bins relative to each of the plurality of disease-associated reference genomes of the pathogen. The similarity index is associated with a proportion of a plurality of variation sites within a respective bin in which at least one of the plurality of sequence reads mapped to the reference genome of the pathogen has a same nucleotide variation as a reference genome associated with the disease of the pathogen. In some cases, the disease-associated reference genome of the pathogen comprises a plurality of disease-associated reference genomes of the pathogen, the method comprising determining, for each of the plurality of bins, a respective similarity index relative to each of the plurality of disease-associated reference genomes of the pathogen; and determining a bin score for each bin of the plurality of bins based on a proportion of the plurality of disease-related reference genomes, the respective similarity index within the plurality of bins relative to the bin score being above a cutoff value.

Analysis of cell-free nucleic acid molecules

The analysis for screening a plurality of cell-free nucleic acid molecules from a biological sample of a subject may be any suitable nucleic acid analysis. For example, a variety of sequencing methods can be used to analyze the number (e.g., copy number or percentage), methylation state, fragment size, or relative abundance of a plurality of cell-free nucleic acid molecules. Alternatively or additionally, amplification-or hybridization-based methods, such as various Polymerase Chain Reaction (PCR) methods or microarray-based methods, can also be used. In some cases, immunoprecipitation methods are used to analyze the methylation state of nucleic acid molecules.

In some examples of the disclosure, a screening assay for detecting a plurality of cell-free pathogen nucleic acid molecules (e.g., cell-free EBV DNA) comprises more than one test performed at different time points, and the detectability of the plurality of cell-free pathogen nucleic acid molecules in the plurality of tests may indicate a risk of the subject developing a pathogen-associated disease. For example, the assay may comprise a two-step assay, or an assay protocol comprising 3, 4, 5, 6, 7, 8, 9, 10, or even more tests. Some of the multiple tests may be performed at the same point in time, while other tests may be performed at different point(s) in time, alternatively, all tests may be performed at different points in time.

The time or frequency of screening for different screening assays can be determined by various methods and various systems provided herein. The interval between the first screening assay and the second screening assay can be at least about 2 months, 4 months, 6 months, 8 months, 10 months, or 12 months. In some cases, the interval is at least about 12 months. The interval between the first screening assay and the second screening assay can be about 1 year, 1.5 years, 2 years, 2.5 years, 3 years, 3.5 years, 4 years, 4.5 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, or more. As long as the subject is normally diagnosed by a mature clinical diagnostic method as not having a pathogen-associated disease (e.g., not having a clinically detectable pathogen-associated disease), the interval can be long, even though the first screening assay can give a positive result indicating the presence of a pathogen-associated disease. The methods and systems provided herein are capable of predicting a subject's risk of developing a pathogen-associated disease in the future (e.g., within 6 months, 12 months, 2 years, 3 years, 5 years, or 10 years). Based on the assessed risk, an appropriate follow-up time point may be determined.

The time between obtaining the sample and performing the analysis can be optimized to increase the sensitivity and/or specificity of the analysis or method. In some embodiments, the sample may be obtained immediately before the analysis is performed (e.g., a first sample is obtained before the first analysis is performed, and a second sample is obtained after the first analysis is performed but before the second analysis is performed). In some embodiments, the sample may be obtained and stored for a period of time (e.g., hours, days, or weeks) before performing the analysis. In some embodiments, the sample can be analyzed 1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 5 weeks, 6 weeks, 7 weeks, 8 weeks, 3 months, 4 months, 5 months, 6 months, 1 year, or more than 1 year after the sample is obtained from the subject.

The time between performing the analysis (e.g., the first analysis or the second analysis) and determining whether the sample contains a marker or set of markers indicative of a disease (e.g., a tumor) may vary. In some cases, the time may be optimized to increase the sensitivity and/or specificity of the assay or method. In some embodiments, determining whether a sample contains a marker or set of markers indicative of a tumor may occur within at most 0.1 hour, 0.5 hour, 1 hour, 2 hours, 4 hours, 8 hours, 12 hours, 24 hours, 2 days, 3 days, 4 days, 5 days, 6 days, 1 week, 2 weeks, 3 weeks, or 1 month of performing the analysis.

Sequencing analysis of a biological sample as described herein can be used to analyze one or more characteristics of a plurality of cell-free nucleic acid molecules from a pathogen. Various methods provided herein can comprise sequencing a plurality of nucleic acid molecules (e.g., a plurality of cell-free nucleic acid molecules, a plurality of cellular nucleic acid molecules, or both) from a biological sample. In some examples, the methods provided herein comprise analyzing a plurality of sequencing results, e.g., a plurality of sequencing reads, of a plurality of nucleic acid molecules from a biological sample. Various methods and various systems provided herein may or may not involve an active step of sequencing. Methods and systems may include or provide methods for receiving and processing sequencing data from a sequencer. Methods and systems may also include or provide methods of providing commands to a sequencer to adjust parameter(s) of a sequencing program, e.g., a plurality of commands analyzed based on a plurality of sequencing results.

Commercial sequencing equipment can be used in the various methods provided in this disclosure, such as the Illumina sequencing platform and the 454/Roche platform. The nucleic acid can be sequenced using any method known in the art. For example, sequencing may comprise next generation sequencing (next generation sequencing). In certain instances, chain termination sequencing, hybridization sequencing, Illumina sequencing (e.g., using a plurality of reversible terminator dyes), ion torrent semiconductor sequencing, mass spectrophotometry sequencing, Massively Parallel Signature Sequencing (MPSS), Maxam-Gilbert sequencing (Maxam-Gilbert sequencing), nanopore sequencing, polymerase clone sequencing, pyrosequencing, shotgun sequencing, single molecule real-time (SMRT) sequencing, SOLiD sequencing (hybridization using four fluorescently labeled double base probes), universal sequencing, or any combination thereof can be used.

One sequencing method that can be used in the various methods provided herein can involve paired-end sequencing, e.g., using Illumina "sequencing-by-read module" and its genome analyzer. Using this module, the sequencing-on-read module can guide the re-synthesis of the original template and the generation of the second round of clusters after the genome analyzer completes the first sequencing read. By using paired-end reads in the various methods provided herein, sequence information can be obtained from both ends of a nucleic acid molecule and mapped to a reference genome, e.g., the genome of a pathogen or the genome of a host organism. After mapping both ends, one can determine a pathogen integration profile according to some embodiments of the various methods as provided herein.

During paired-end sequencing, the read from the first end sequence of the nucleic acid molecule can comprise at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 105, at least 110, at least 105, at least 120, at least 125, at least 130, at least 135, at least 140, at least 145, at least 150, at least 155, at least 160, at least 165, at least 170, at least 175, or at least 180 contiguous nucleotides. Sequence reads can comprise at most 24, at most 28, at most 32, at most 38, at most 42, at most 48, at most 52, at most 58, at most 62, at most 68, at most 72, at most 78, at most 82, at most 88, at most 92, at most 98, at most 102, at most 108, at most 122, at most 128, at most 132, at most 138, at most 142, at most 148, at most 152, at most 158, at most 162, at most 168, at most 172, or at most 180 consecutive nucleotides from a first end of a nucleic acid molecule. The sequence read from the first end of the nucleic acid molecule can comprise about 20, about 25, about 30, about 35, about 40, about 45, about 50, about 55, about 60, about 65, about 70, about 75, about 80, about 85, about 90, about 95, about 100, about 105, about 110, about 105, about 120, about 125, about 130, about 135, about 140, about 145, about 150, about 155, about 160, about 165, about 170, about 175, or about 180 consecutive nucleotides. The reads from the second end of the nucleic acid molecule can comprise at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 105, at least 110, at least 105, at least 120, at least 125, at least 130, at least 135, at least 140, at least 145, at least 150, at least 155, at least 160, at least 165, at least 170, at least 175, or at least 180 contiguous nucleotides. The reads from the second end of the nucleic acid molecule can comprise at most 24, at most 28, at most 32, at most 38, at most 42, at most 48, at most 52, at most 58, at most 62, at most 68, at most 72, at most 78, at most 82, at most 88, at most 92, at most 98, at most 102, at most 108, at most 122, at most 128, at most 132, at most 138, at most 142, at most 148, at most 152, at most 158, at most 162, at most 168, at most 172, or at most 180 consecutive nucleotides. The sequence read from the second end of the nucleic acid molecule can comprise about 20, about 25, about 30, about 35, about 40, about 45, about 50, about 55, about 60, about 65, about 70, about 75, about 80, about 85, about 90, about 95, about 100, about 105, about 110, about 105, about 120, about 125, about 130, about 135, about 140, about 145, about 150, about 155, about 160, about 165, about 170, about 175, or about 180 consecutive nucleotides. In some cases, a sequence read from a first end of a nucleic acid molecule can comprise at least 75 contiguous nucleotides. In some cases, a read from the second end sequence of the nucleic acid molecule can comprise at least 75 contiguous nucleotides. The sequence reads from the first and second ends of the nucleic acid molecule may be of the same length or of different lengths. The sequences of the plurality of nucleic acid molecules read from the biological sample may have the same length or different lengths.

Sequencing in the various methods provided herein can be performed at different sequencing depths. The depth of sequencing refers to the number of times a locus is covered by sequence reads that are aligned to the locus. The locus may be as small as a nucleotide, or as large as a chromosomal arm, or as large as the entire genome. The sequencing depth in the various methods provided herein can be 50-fold (50x), 100-fold (100x), etc., where the number before "fold (x)" refers to the number of times a site is covered by a sequence read. The sequencing depth may also be applied to multiple sites or to the entire genome, in which case the fold (x) may refer to the average number of times a locus or haploid genome or entire genome, respectively, is sequenced. In certain instances, ultra-deep sequencing is performed in the various methods described herein, which may refer to performing at least 100-fold the sequencing depth.

The number or average number of reads of a particular nucleotide within a nucleic acid during a sequencing process (e.g., the depth of sequencing) may be several times greater than the length of the nucleic acid being sequenced. In some cases, when the sequencing depth is significantly (e.g., at least 5 times) greater than the length of the nucleic acid, the sequencing can be referred to as deep sequencing. In some examples, the sequencing depth can be, on average, at least about 5-fold greater, at least about 10-fold greater, at least about 20-fold greater, at least about 30-fold greater, at least about 40-fold greater, at least about 50-fold greater, at least about 60-fold greater, at least about 70-fold greater, at least about 80-fold greater, at least about 90-fold greater, at least about 100-fold greater than the length of the nucleic acid being sequenced. In certain instances, a sample can be enriched for a particular analyte (e.g., a nucleic acid fragment or a cancer-specific nucleic acid fragment).

A sequence read (or multiple sequencing reads) generated in the various methods provided herein can refer to a nucleotide string sequenced from any portion or all of a nucleic acid molecule. For example, the sequence reads can be short strings of nucleotides (e.g., 20-150) complementary to the nucleic acid fragments, strings of nucleotides complementary to the ends of the nucleic acid fragments, or strings of nucleotides complementary to the entire nucleic acid fragments present in the biological sample. Sequence reads can be obtained in a variety of ways, e.g., using a variety of sequencing techniques

Quantitative/detectability

One of the characteristics of the plurality of cell-free nucleic acid molecules that can be used in the various methods and the various systems is the number (e.g., copy number or percentage) of the plurality of cell-free nucleic acid molecules from the pathogen. Some aspects of the present disclosure relate to risk stratification of a subject for developing a pathogen-associated disease based on an assessment of the number (e.g., copy number or percentage) of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject.

The copy number of nucleic acid molecules in a biological sample is related to the detectability of the nucleic acid molecules. Given a particular assay, the detectability of a nucleic acid template can be correlated with the copy number of the template molecule, e.g., a copy number lower than the detection limit of the assay can be undetectable, while a copy number at or above the detection limit of the assay can be referred to as "detectable," e.g., a quantitative polymerase chain reaction (qPCR) assay typically has a lower detection limit at which the signal of the template molecule is indistinguishable from background noise. Thus, in certain instances, the various methods and systems provided herein directly rely on the detectability of a plurality of cell-free nucleic acid molecules in a biological sample, which can be correlated with their copy number in the biological sample. In some cases, the copy number of the plurality of cell-free nucleic acid molecules in the biological sample is directly measured. In other cases, copy number is implicitly measured or inferred by detecting the plurality of cell-free nucleic acid molecules themselves.

A variety of detection assays, such as Polymerase Chain Reaction (PCR) or quantitative polymerase chain reaction (qPCR), can be performed to assess the presence or absence or copy number of a plurality of cell-free nucleic acid molecules of a pathogen in a biological sample. Various probes can be designed to target pathogen-specific genomic regions, for example, EBV-specific genomic DNA sequences, Human Papilloma Virus (HPV) -specific genomic DNA sequences, or Hepatitis B Virus (HBV) -specific genomic DNA sequences.

While various examples and embodiments are provided herein, additional techniques and embodiments related to, for example, copy number and nasopharyngeal carcinoma, can be found in PCT international patent application No. PCT AU/2011/001562, filed 2011, month 11, 30, which is incorporated herein by reference in its entirety. Nasopharyngeal carcinoma may be closely associated with Epstein-Barr Virus (EBV) infection. In southern china, the EBV genome is found in tumor tissues of almost all patients with nasopharyngeal carcinoma. Plasma EBV DNA from nasopharyngeal carcinoma tissues has been developed as a tumor marker for nasopharyngeal carcinoma (Lo et al, Cancer Res journal 1999; 59: 1188-1191.) in particular, real-time qPCR analysis can be used for plasma EBV DNA analysis against the BamHI-W fragment of the EBV genome. There may be about 6 to 12 BamHI-W fragment repeats per EBV genome 5 and about 50 EBV genomes per nasopharyngeal carcinoma tumor cell (Longnecer et al, Fields Virology, 5 th edition, Chapter 61 Epstein-Barr Virus; Tierney et al, J Virol. J.J.2011; 85: 12362-12375.) in other words there may be 300-600 (e.g., about 500) PCR target copies per nasopharyngeal carcinoma tumor cell. Such a high target number per tumor cell may explain why plasma EBV DNA is a highly sensitive marker for the detection of early stage nasopharyngeal carcinoma. Nasopharyngeal carcinoma cells are capable of depositing EBV DNA fragments into the blood of a subject. This tumor marker was used for monitoring (Lo et al, Cancer Res journal 1999; 59: 5452-.

qPCR analysis can also be used to measure the amount of HPV, HBV, or any other viral DNA in a sample in a manner similar to EBV described herein. This assay is particularly useful for screening for Cervical Cancer (CC), Head and Neck Squamous Cell Carcinoma (HNSCC), cirrhosis or hepatocellular carcinoma (HCC). In one example, qPCR analyzes a region (e.g., 200 nucleotides) within the polymorphic L1 region of the target HPV genome. More specifically, contemplated herein is the use of qPCR primers that selectively hybridize to sequences encoding one or more hypervariable surface loops in the L1 region.

Alternatively, a plurality of cell-free nucleic acid molecules from a pathogen can be detected and quantified using a plurality of sequencing techniques. For example, cfDNA fragments can be sequenced and aligned and quantified with HPV reference genomes. Or in other examples, multiple sequence reads of multiple cfDNA fragments are aligned and quantified to a reference genome of EBV or HBV.

The detectability or copy number of a plurality of cell-free nucleic acid molecules from a pathogen, as measured by the analysis provided herein, can indicate a risk of the subject developing a pathogen-associated disease. In some examples, the higher the copy number of the plurality of cell-free nucleic acid molecules from the pathogen, the higher the risk of the subject developing a pathogen-associated disease. In certain instances, the detectability of a plurality of cell-free nucleic acid molecules from a pathogen in one or more analyses at a particular time point or time points indicates a risk of the subject developing a pathogen-associated disease. When a plurality of cell-free nucleic acid molecules from a pathogen in a biological sample from a subject is detectable, the subject may be placed at higher risk of a pathogen-associated disease than when the plurality of molecules are not detectable by the analysis provided herein. A multi-step detection analysis may be performed at the times described above.

In some examples of the present disclosure, a two-step analysis is performed to detect a plurality of cell-free pathogen nucleic acid molecules in a biological sample. In some cases, depending on the result of the analysis at the first point in time, a first trial of the two-step analysis is performed, followed by a second trial of the two-step analysis being performed or not. For example, if a first test provides a positive result, e.g., a cell-free pathogen nucleic acid molecule is detected in a first biological sample, a second test of a two-step detection assay may be performed; if a negative result is obtained from the first test, the second test may not be performed. In other cases, a second test is performed regardless of the first test. In some examples, a case where both tests of a two-step assay have a positive result is referred to as a permanent positive, and a case where only the first or second test has a positive result is referred to as a temporary positive. In one illustrative example, a "negative" analysis result indicates that the subject is at a higher risk of developing a pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma) than a "positive" analysis result, and a "permanent positive" analysis result indicates a higher risk than a "transient positive" analysis result. In some illustrative examples, a longer interval may be set between the first time point and the second time point when a permanent positive result is obtained from a two-step detection assay performed at the first time point than when a transient positive result is obtained. For example, in an EBV-associated nasopharyngeal cancer screen, if a permanent positive result is obtained from a first detection assay of a two-step detection assay, a subsequent second screening assay can be recommended within about one year after the first detection assay. Conversely, if a short positive result is obtained from a first detection assay of the two-step detection assays, a subsequent second screening assay can be performed within about two years of the first detection assay. If negative results are obtained, the time between subsequent screening tests may be four years or even longer. In some cases, a previous positive result indicating a higher risk may override the interval selection, which will be set by a subsequent result indicating a lower risk. For example, if a permanent positive result is obtained in year 1, the subject will follow-up each year for the next 4 years, regardless of the follow-up analysis results performed in the subsequent 4 years. An illustrative example is given in fig. 2 and described in more detail in example 2. Similar to the detection analysis, risk assessment based on other characteristics of a plurality of cell-free nucleic acid molecules of a pathogen may also follow this exemplary or similar screening protocol.

The second trial may be performed within hours, days or weeks after the first trial. In one example, the second analysis may be performed immediately after the first analysis. In other instances, the second analysis can be performed 1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 5 weeks, 6 weeks, 7 weeks, 8 weeks, 3 months, 4 months, 5 months, 6 months, 1 year, or more than 1 year after the first analysis. In a particular example, the second analysis may be performed within 2 weeks of the first sample. In general, a second test can be used to increase specificity, thereby detecting a pathogen-associated disease, such as a tumor, in a patient. The time between performing the first test and the second test may be determined experimentally. In some embodiments, the method may include 2 or more trials, and both trials use the same sample (e.g., a single sample is obtained from a subject (e.g., a patient) prior to performing a first analysis and is saved for a period of time until a second analysis is performed). For example, two tubes of blood may be obtained from a subject simultaneously. The first tube may be used for the first test. The second tube may be used only if the subject's first test result is positive. The sample may be preserved using any method known to those skilled in the art (e.g., cryogenics). Such a saving may be beneficial in certain situations, for example, where a subject may receive a positive test result (e.g., a first analysis indicates cancer), and the patient may not wait for a second test and choose to seek a second opinion.

Methylation state

Some aspects of the present disclosure relate to stratification of risk of a subject developing a pathogen-associated disease based on an assessment of methylation status of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject.

Methylation of a plurality of cell-free pathogen nucleic acid molecules can distinguish between samples from subjects having a pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma or HPV-associated cervical carcinoma) and subjects not having the disease (e.g., non-nasopharyngeal carcinoma subjects). For example, the methylation status of plasma EBV DNA associated with nasopharyngeal carcinoma can be different from the methylation status of plasma EBV DNA detected in a non-nasopharyngeal carcinoma subject, as shown in U.S. patent application No. 16/046,795, which is incorporated herein by reference in its entirety. When analyzed by bisulfite sequencing, there may be regions of differential methylation between plasma DNA of nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects, and EBV DNA may be detected. Thus, analysis of the methylation status of these differentially methylated regions can distinguish between nasopharyngeal and non-nasopharyngeal cancer subjects. As described herein, the nasopharyngeal cancer-associated EBV DNA methylation status can also predict the risk of nasopharyngeal cancer development and can be used to adjust nasopharyngeal cancer screening intervals. For example, subjects with an EBV DNA methylation pattern associated with nasopharyngeal carcinoma may be screened more frequently than subjects without an EBV DNA methylation pattern associated with nasopharyngeal carcinoma. In some cases, other types of recognizable methylation sequencing can be used instead of bisulfite sequencing, for example, using single molecule sequencing systems such as Pacific Biosciences' sequencing systems (Kelleher et al, Methods Mol biol. journal 2018; 1681: 127-. In another case, molecular Methods that recognize methylation and are not based on sequencing can be used, such as methylation-specific PCR (Herman et al, Proc. Nature Acad. Sci. USA 1996; 93:9821-6), detection systems based on methylation-sensitive enzymes such as restriction enzymes, and bisulfite conversion followed by mass spectrometry (van den Boom et al, Methods Mol biol. journal 2009; 507: 207-27; Nygren et al, Clin chem. journal 2010; 56:1627-35), as well as Methods based on differential precipitation of the methylation state of DNA molecules (e.g., using anti-methylated cytosine antibodies (Shen et al, Nature 5631 2018; 579-83; Zhou et al, PLoS One. journal 2018; 13: e0201586) or methylated binding proteins (Zhang et al, Nature 5631: 2013; 2017).

In certain instances, methylation patterns of a plurality of cell-free pathogen nucleic acid molecules (e.g., plasma EBV DNA) can be used to detect a plurality of pathogen-associated diseases (e.g., pathogen-associated cancers, such as nasopharyngeal carcinoma), or to predict a future risk of having a clinically detectable disease. As described above, one approach is to treat a plurality of nucleic acid molecules with bisulfite to convert unmethylated cytosines to uracil. Methylated cytosines are not altered by bisulfite but remain cytosines. Subsequent examination, such as sequencing, of the bisulfite-treated plurality of nucleic acid molecules can be used to detect the methylation state of the plurality of nucleic acid molecules in the biological sample.

In one example, differences in plasma EBV DNA methylation levels are determined using methylation sensitive restriction enzyme analysis. One non-limiting example of a methylation sensitive restriction enzyme is HpaII, which can cleave molecules carrying an unmethylated "CCGG" motif, but leave molecules that are free of "CCGG" or methylated "CCGG" unchanged. Alternatively, other methylation sensitive restriction enzymes can be used. In one example, the EBV DNA in plasma from a non-cancer subject may be more susceptible to cleavage by methylation sensitive restriction enzymes due to the lower methylation levels of EBV DNA in plasma from a non-cancer subject. The sensitivity of enzymatic digestion can be determined, for example, but not limited to, massively parallel sequencing, gel electrophoresis, capillary electrophoresis, Polymerase Chain Reaction (PCR), and real-time PCR.

In the case of analyzing the degree of digestion of methylation sensitive restriction enzymes using sequencing (e.g., massively parallel sequencing), the degree of digestion can be reflected using the size distribution (with and without enzymatic digestion) of a plurality of cell-free nucleic acid molecules (e.g., plasma EBV DNA) of the pathogen. As shown in fig. 12 and 13, a leftward shift of the size distribution curve indicates that the size distribution of plasma EBV DNA becomes shorter. The more left the curve is shifted, reflecting a higher degree of enzymatic digestion, meaning a lower level of DNA methylation.

The methylation state of a plurality of cell-free pathogen nucleic acid molecules as described herein can comprise the methylation density of individual methylation sites, the distribution of methylated/unmethylated sites over adjacent regions on the pathogen genome, the methylation pattern or level of each individual methylation site within one or more specific regions on the pathogen genome or throughout the pathogen genome, and non-CpG methylation. In some cases, the methylation status comprises the methylation levels (or methylation densities) of individual differential methylation sites that can be identified, for example, between a patient having a pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma or HPV-associated cervical carcinoma) and a sample of a subject without the disease (e.g., a non-nasopharyngeal carcinoma subject). For a given methylation site, methylation density can refer to the fraction of nucleic acid molecules methylated at a given methylation site that exceeds the total number of nucleic acid molecules of interest that comprise such methylation site. For example, the methylation density of a first methylation site in liver tissue can refer to a portion of the plurality of liver DNA molecules methylated at the first site among the entire liver DNA molecules. In certain instances, the methylation state comprises a correspondence (e.g., a pattern or haplotype) of methylation/unmethylated state between individual methylation sites.

In certain instances, a screening assay (e.g., a first assay or a second assay) as described herein can comprise determining the methylation state of a plurality of cell-free nucleic acid molecules by any available technique, such as, but not limited to, performing identifiable methylation-aware sequencing, methylation-sensitive amplification, or methylation-sensitive precipitation. While examples and embodiments are provided herein, additional techniques and embodiments related to, for example, determining methylation status can be found in PCT patent application No. PCT AU/2013/001088, filed 2013, 9, 20, which is incorporated herein by reference in its entirety.

Fragment size

Some aspects of the present disclosure relate to stratification of risk of a subject developing a pathogen-associated disease based on an assessment of fragment sizes of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject.

The fragment size distribution and/or relative abundance of the plurality of cell-free pathogen nucleic acid molecules can distinguish between a sample of a plurality of patients having a pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma or HPV-associated cervical carcinoma) and a sample of a subject without the disease (e.g., a non-nasopharyngeal carcinoma subject). For example, the size distribution of plasma EBV DNA molecules and the ratio of circulating DNA molecules mapped to the EBV genome and the human genome help to distinguish nasopharyngeal cancer patients from non-nasopharyngeal cancer subjects in which plasma EBV DNA is detectable, as shown by massively parallel sequencing (Lam et al, proceedings of the national academy of sciences USA 2018; 115: E5115-E5124), which is incorporated herein by reference in its entirety. According to some examples of the disclosure, the size distribution associated with nasopharyngeal carcinoma and the relative abundance of circulating DNA mapped to EBV and the human genome can also be used to predict the risk of developing a clinically detectable nasopharyngeal carcinoma in the future. In one embodiment, a subject having these nasopharyngeal carcinoma-associated features but not having detected nasopharyngeal carcinoma in plasma DNA sequencing may follow up more frequently than a subject having detectable plasma EBV DNA but not having these nasopharyngeal carcinoma-associated features. With the two-step assay described above, there is a potential practical advantage to stratify the risk of nasopharyngeal carcinoma using such a sequencing-based assay, in that the collection of another blood sample from the patient can be eliminated.

In certain instances, analyzing (e.g., a first analysis or a second analysis) can include performing an analysis (e.g., a next generation sequencing analysis) to analyze nucleic acid fragment size (e.g., fragment size of plasma EBV DNA). In some cases, sequencing is used to assess the size of cell-free viral nucleic acids in a sample. For example, the size of each sequenced plasma DNA molecule can be derived from the start and end coordinates of the sequence, where the coordinates can be determined by mapping (aligning) multiple sequence reads to the viral genome. In various examples, the start and end coordinates of a DNA molecule can be determined by two double-ended reads or a single read covering both ends (which can be done in single molecule sequencing). In some cases, amplification or hybridization based methods can also be used for fragment size analysis. For example, probes can be designed to target genomic regions of different lengths, and amplification (e.g., PCR or qPCR) or hybridization signals can indicate the number of cell-free nucleic acid fragments of the target genomic region (while being equal to or greater in length than the target region). From which the distribution of segment sizes can be deduced. Various methods for fragment size analysis (assays and analytes) may include the methods described in U.S. patent application publication No. US20180208999a1, which is incorporated herein by reference in its entirety.

The fragment size distribution may be displayed as a histogram of the sizes of the nucleic acid fragments on the horizontal axis. The number of nucleic acid fragments of each size (e.g., within a 1 base pair (bp) resolution) can be determined and plotted on the vertical axis (e.g., as a percentage of the original number or frequency). The size resolution may exceed 1bp (e.g., 2, 3, 4, or 5bp resolution). The following analysis of the size distribution (also referred to as size distribution map) shows that viral DNA fragments in cell-free mixtures from subjects with nasopharyngeal carcinoma are statistically longer than in subjects without overt condition. In one illustrative example, in the fragment size distribution curve obtained from plasma EBV DNA analysis, there may be a feature in the plasma EBV DNA size curve of nasopharyngeal cancer patients that represents a peak of 166bp (nucleosome pattern), while plasma EBV DNA of non-cancer subjects does not show a typical nucleosome pattern.

In some cases, the relative abundance of a plurality of cell-free nucleic acid molecules from a pathogen relative to a plurality of cell-free nucleic acid molecules from a subject is calculated to assess risk. In some cases, the relative abundance is analyzed on a size scale. In various examples, the size ratio of pathogen fragments to cell-free fragments from a subject refers to cell-free nucleic acid fragments from a pathogen to cell-free nucleic acid fragments from a subject

Quantitative ratio between acid fragments. For example, the size ratio (size ratio) of EBV DNA fragments between 80 and 110 base pairs may be:

in each case, a cutoff value or threshold is set for the evaluation. For example, there may be a size threshold (size threshold) for determining the size ratio between a pathogen fragment and a subject autosomal (autosomal) fragment. Or in some cases, a size threshold is set such that multiple fragments with sizes below or above the threshold are considered indicative of a subject's risk of developing a pathogen-associated disease. It should be understood that the size threshold may be any value. The size threshold may be at least about 10bp, 20bp, 25bp, 30bp, 35bp, 40bp, 45bp, 50bp, 55bp, 60bp, 65bp, 70bp, 75bp, 80bp, 85bp, 90bp, 95bp, 100bp, 105bp, 110bp, 115bp, 120bp, 125bp, 130bp, 135bp, 140bp, 145bp, 150bp, 155bp, 160bp, 165bp, 170bp, 175bp, 180bp, 185bp, 190bp, 195bp, 200bp, 210bp, 230bp, 240bp, 250bp, or greater than 250 bp. For example, the size threshold may be 150 bp. In another example, the size threshold may be 180 bp. In some embodiments, higher and lower size thresholds (e.g., a range of values) may be used. In some embodiments, higher and lower size thresholds can be used to select nucleic acid fragments having lengths between the higher and lower cut-off values. In some embodiments, higher and lower cut-off values can be used to select nucleic acid fragments having a length greater than the higher cut-off value and less than the lower size threshold. In some cases, a size ratio cutoff value is used to determine whether a subject is at risk, or how high a subject is at risk for developing a pathogen-associated disease (e.g., nasopharyngeal carcinoma). For example, subjects with nasopharyngeal carcinoma have a lower size ratio, in the size range of 80-110bp, than subjects who are false positive for plasma EBV DNA results. In some cases, the cut-off value for the size ratio can be about 0.1, about 0.5, about 1, about 2, about 3, about 4, about 5, about 6, about 7, about 8, about 9, about 10, about 11, about 12, about 13, about 14, about 15, about 16, about 17, about 18, about 19, about 20, about 25, about 50, about 100, or greater than about 100. In some cases, the cutoff value for the size indicator may be about or at least 10, about or at least 2, about or at least 1, about or at least 0.5, about or at least 0.333, about or at least 0.25, about or at least 0.2, about or at least 0.167, about or at least 0.143, about or at least 0.125, about or at least 0.111, about or at least 0.1, about or at least 0.091, about or at least 0.083, about or at least 0.077, about or at least 0.071, about or at least 0.067, about or at least 0.063, about or at least 0.059, about or at least 0.056, about or at least 0.053, about or at least 0.05, about or at least 0.04, about or at least 0.02, about or at least 0.001, or less than about 0.001.

Various statistics of the size distribution of nucleic acid fragments can be determined. For example, an average, mode, median, or mean of the size distribution may be used. Other statistical values may be used, such as cumulative frequency for a given size or various ratios of the number of nucleic acid fragments of different sizes. The cumulative frequency may correspond to a proportion (e.g., percentage) of DNA fragments of a given size or less or more than a given size. The plurality of statistical values provides information about the size distribution of the nucleic acid fragments for comparison against one or more cut-off values for determining the level of a pathogen-caused disorder. The cutoff value may be determined using a cohort of healthy subjects, subjects known to have one or more disorders, subjects for which the disorder associated with the pathogen is a false positive, and other subjects mentioned herein. One skilled in the art would know how to determine such a cutoff value based on the description herein.

In some examples, the first statistical value of pathogen fragment size may be compared to a reference statistical value of human genome size. For example, a separation value (e.g., a difference or ratio) may be determined between the first statistical value and a reference statistical value, e.g., determined from other regions in a pathogen reference genome or from human nucleic acids. The separation value may also be determined from other values. For example, the reference value may be determined from statistics of a plurality of areas. The separation value can be compared to a size threshold to obtain a size classification (e.g., whether the DNA fragment is shorter, longer, or the same as a normal region).

Some examples may calculate a parameter (separation value) that may be defined as the difference in the proportion of short DNA fragments between a reference pathogen genome and a reference human genome, using the following equation:

ΔF=P(≤150bp)testing-P(≤150bp)Reference to

Wherein P (less than or equal to 150bp)TestingRepresenting the proportion of sequenced fragments from the test region having a size of & lt, 150bp, P & lt, & gt (150 bp)Reference toRepresents the proportion of sequenced fragments from the reference region having a size of ≦ 150 bp. In other embodiments, other size thresholds may be used, such as, but not limited to, 100bp, 110bp, 120bp, 130bp, 140bp, 160bp, and 166 bp. In other embodiments, the size threshold may be expressed in bases, nucleotides, or other units.

The size-based z-score can be calculated using the mean and SD values of the control subjects.

In some embodiments, a z-score >3 based on size indicates an increased proportion of short fragments of the pathogen, and a z-score < -3 based on size indicates a decreased proportion of short fragments of the pathogen. Other size thresholds may be used. For more details on the size-based approach, see U.S. patent nos. 8,620,593 and 8,741,811 and U.S. patent application publication No. 2013/0237431, each of which is incorporated by reference in its entirety.

To determine the size of nucleic acid fragments, at least some examples of the disclosure may work with any single molecule analysis platform in which chromosomal origin and molecular length may be analyzed, such as electrophoresis, optical methods (e.g., optical mapping and variations thereof, en. wikipedia. org/wiki/optical # umapping # cite # note-Nanocoding-3, and Jo et al, proceedings of the american academy of sciences 2007; 104: 2673-. As an example of mass spectrometry, longer molecules will have a larger mass (one example of a size value).

In one example, nucleic acid molecules can be randomly sequenced using a double-ended sequencing protocol. The two reads at both ends may be mapped (aligned) to a reference genome, which may be annotated repeatedly (e.g., when aligned with the human genome). The size of the DNA molecule can be determined by the distance between the genomic locations to which the two reads correspond.

Analysis of variant patterns

Some aspects of the present disclosure relate to stratification of risk of a subject developing a pathogen-associated disease based on an assessment of a pattern of variation of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject. Genetic variations in the genome of a pathogen detected in a biological sample can be used to predict the risk of future development of a pathogen-associated disease.

The pattern of variation of pathogen nucleic acid molecules in diseased tissue of a patient having a pathogen-associated disease (e.g., a pathogen-associated malignancy) may be different as compared to a sample from a subject without a pathogen-associated disease. It has been reported that the presence of EBV strains in EBV-associated tumors and control samples (Palser et al, J Virol. journal 2015; 89:5222-37) may differ. However, in this previous study, the tumor and control samples were from different geographical locations. Given the potential geographic variation of EBV variation, it is difficult to tell whether the variation identified in a tumor sample is geographically or disease associated. Previous attempts have been made to determine nasopharyngeal carcinoma-associated EBV variations by analyzing nasopharyngeal carcinoma tumor samples. Nasopharyngeal carcinoma tumor and saliva samples from individuals without EBV-associated disease in the same geographic region were analyzed in a genome-wide association study (GWAS) (Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049), with 29 polymorphisms (single nucleotide polymorphisms (SNPs) or insertions or deletions (indels)) identified as less than false discovery rate, adjusted P to 0.05. These 29 nasopharyngeal carcinoma-associated EBV variations were present in more than 90% of cases of nasopharyngeal carcinoma, but only 40-50% of control cases.

In contrast to individual EBV polymorphism analyses that develop into nasopharyngeal carcinoma (Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049; Feng et al, Chin J Cancer journal 2015; 34:61), aspects of the present disclosure provide various methods and systems for analyzing the pattern of variation of a pathogen nucleic acid molecule in a genome-wide manner. Furthermore, rather than analyzing tumor and cell line samples to identify disease-associated EBV variants (Palser et al, J Virol. journal 2015; 89:5222-37, Correia et al, J Virol. journal 2018; 92: e01132-18, Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049), aspects of the present disclosure provide methods and systems for analyzing pathogen variant patterns by analyzing cell-free pathogen nucleic acid molecules in, for example, blood (e.g., plasma or serum), nasal washes, nasal brush samples, or other bodily fluids obtained by non-invasive or minimally invasive procedures, as compared to invasive biopsy of tumors. In one example, the low abundance and fragment nature of EBV DNA molecules in blood can pose technical challenges for analysis. Non-invasive analysis of the variation pattern of cell-free viral DNA molecules can improve clinical applications (including screening, predictive medicine, risk stratification, monitoring and prediction). In one example, the analysis can be used to distinguish between subjects with symptoms associated with different viruses, e.g., nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects with detectable plasma EBV DNA in a screening setting. In another example, it may be used for risk prediction of a disease or cancer.

Different methods may be used to obtain different modes. Non-limiting analytical methods can include Massively Parallel Sequencing (MPS), Sanger sequencing (e.g., Lorenzetti et al, J Clin Microbiol. J.2012; 50:609-18), and microarray-based SNP analysis (e.g., Wang et al, Proc. Natl. Acad. Sci. USA 2002; 99:15687-92), hybridization analysis, and mass spectrometry analysis. In one illustrative example, a sequencing method, such as targeted sequencing with capture enrichment, MPS, or Sanger sequencing, is used and a plurality of sequence reads are analyzed based on a reference genome (e.g., an EBV reference genome) of each nucleotide reference pathogen. The method can comprise obtaining a plurality of sequence reads for a plurality of cell-free nucleic acid molecules from a biological sample of a subject. The method can further comprise aligning the plurality of sequence reads to a reference genome of the pathogen. The method can further include analyzing a pattern of nucleotide variations across a reference genome of the pathogen by analyzing nucleotide variations between a plurality of sequence reads mapped to the reference genome of the pathogen. The variation patterns provided herein characterize a nucleotide variation mapped to the plurality of sequence reads of the reference genome of the pathogen at each of a plurality of variation sites on the reference genome of the pathogen. The plurality of variation sites can comprise at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 1100, or at least 1200 sites across the reference genome of the pathogen. The plurality of variation sites comprises at least 1000 sites across the reference genome of the pathogen. The plurality of variant sites comprises about 1100 sites across the reference genome of the pathogen. The plurality of variation sites comprises at least 600 sites across the reference genome of the pathogen. The plurality of variation sites comprises about 660 sites across the reference genome of the pathogen. In some cases, the plurality of variant sites comprises at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites selected from the plurality of genomic sites listed in table 6 relative to the EBV reference genome (AJ 507799.2). In some cases, the plurality of mutation sites comprises a genomic site as listed in table 6 relative to an EBV reference genome (AJ 507799.2).

In some cases, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites randomly selected from a plurality of genomic sites listed in table 6 relative to an EBV reference genome (AJ 507799.2). In some cases, the methods provided herein comprise a step of randomly selecting from among a plurality of genomic loci listed in table 6 relative to an EBV reference genome (AJ 507799.2). The method can further comprise analyzing a pattern of nucleotide variation across a reference genome of the pathogen by analyzing nucleotide variation between sequence reads of the reference genome of the pathogen and the reference genome mapped to the pathogen.

In some cases, the pattern of variation of the plurality of cell-free nucleic acid molecules from the pathogen characterizes a nucleotide variation of the plurality of sequence reads mapped to the reference genome of the pathogen at each of the plurality of variation sites, the plurality of variation sites comprising at least 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 sites randomly selected from a plurality of genomic sites listed in table 6 relative to an EBV reference genome (AJ 507799.2).

In some cases, the plurality of variation sites consists of all sites at which the plurality of sequence reads mapped to the reference genome of the pathogen have a different nucleotide variation from the reference genome of the pathogen.

In some cases, the wild-type pathogen genome is used as a reference genome. For example, a wild-type (wide type) EBV genome (GenBank: AJ507799.2) can be used as the reference EBV genome. In other cases, other pathogen genomes are used as reference genomes. In another example, multiple pathogen genomes (e.g., EBV genomes) are used as references. In another example, a consensus sequence (consensus sequence) is used as a reference. A consensus can be established by combining variations of different pathogen genomic sequences, e.g., the EBV genomic consensus sequence, e.g., de Jesus et al, J Gen Virol journal 2003; 84: 1443-50.

Sequence alignment (e.g., for analyzing copy number, methylation state, fragment size, relative abundance, or pattern of variation) used in the various methods and systems provided herein can be performed by any suitable bioinformatic algorithm, program, kit, or software package (package). For example, Short Oligonucleotide Analysis Package (SOAP) may be used as an alignment tool for various applications of the various methods and systems provided herein. Examples of short sequence read analysis tools that can be used in the various methods and systems provided herein include Arioc, Barracuda, BBMap, BFAST, BigBWA, BLASTN, BLAT, Bowtie2, BWA-PSSM, CASHX, Cloudburst, CUDA-EC, CUSHAW2, CUSHAW2-GPU, CUSHAW3, DRFAST, SOAP, ERNE, GASSST, GEM, Genalice, Geneius Assembler, Gensearch, GMAP and GSNAP, GNUMAP, HIVE-hexagon, Isaac, LAST, MAQ, mIKr, mRST, MOM, MOSASASAN, Novoscign & NovoalcalinCS, Nextext, NEXxtensin, Velcreastern, Vekstakstakstakstak, Velcreak, Velcreastern, Velcreak, Velcreastern, Velcreax, Maskumpp, Masfah, Masfag, Masjmark, Masidle, Masjjar, Masjmark, Massach, Massachim, Masidle, Massachie, Massach, Massachie, Masidle, Massachie, S, Masidle, Massachie, S, Massachie, Sphere, Massachie, S, Massachie, Spidle, S, Spidle, S, Spidle, S, Spidle, S, Spidle, S, Spidle, S, Spidle, S.

A number of consecutive nucleotides in a sequence read ("sequence stretch") can be used to align with a reference genome to make calls for alignment. For example, the alignment can comprise aligning at least 4, at least 6, at least 8, at least 10, at least 12, at least 14, at least 16, at least 18, at least 20, at least 22, at least 24, at least 25, at least 26, at least 28, at least 30, at least 32, at least 34, at least 35, at least 36, at least 38, at least 40, at least 42, at least 44, at least 45, at least 46, at least 48, at least 50, at least 52, at least 54, at least 55, at least 56, at least 58, at least 60, at least 62, at least 64, at least 65, at least 66, at least 67, at least 68, at least 69, at least 70, at least 71, at least 72, at least 73, at least 74, at least 75, at least 76, at least 78, at least 80, at least 82, at least 84, at least 85, at least 86, at least 32, at least 34, at least 36, at least 24, or a portion of the sequence reads of a sequence read of a reference genome of a host organism, and/or a reference genome of a subject to a reference genome of a subject At least 88, at least 90, at least 92, at least 94, at least 95, at least 96, at least 98, at least 100, at least 102, at least 104, at least 106, at least 108, at least 110, at least 112, at least 114, at least 116, at least 118, at least 120, at least 122, at least 124, at least 126, at least 128, at least 130, at least 132, at least 134, at least 136, at least 138, at least 140, at least 142, at least 145, at least 146, at least 148, or at least 150 consecutive nucleotides. In some cases, an alignment as described herein can comprise at most 5, at most 7, at most 9, at most 11, at most 13, at most 15, at most 17, at most 19, at most 21, at most 23, at most 25, at most 27, at most 29, at most 31, at most 33, at most 35, at most 37, at most 39, at most 41, at most 43, at most 45, at most 47, at most 49, at most 51, at most 53, at most 55, at most 57, at most 59, at most 61, at most 63, at most 65, at most 67, at most 68, at most 69, at most 70, at most 71, at most 72, at most 73, at most 74, at most 75, at most 76, at most 78, at most 80, at most 81, at most 83, at most 85, at most 87, at most 89, at most 91, at most 93, at most 95, of the sequence reads of an alignment reference genome (e.g., a pathogen or host organism) that are aligned, At most 97, at most 99, at most 101, at most 103, at most 105, at most 107, at most 109, at most 111, at most 113, at most 115, at most 117, at most 119, at most 121, at most 123, at most 125, at most 127, at most 129, at most 131, at most 133, at most 135, at most 137, at most 139, at most 141, at most 143, at most 145, at most 147, at most 149, or at most 151 consecutive nucleotides. In some examples, an alignment as described herein comprises aligning about 20, about 22, about 24, about 25, about 26, about 28, about 30, about 32, about 34, about 35, about 36, about 38, about 40, about 42, about 44, about 45, about 46, about 48, about 50, about 52, about 54, about 55, about 56, about 58, about 60, about 62, about 64, about 65, about 66, about 67, about 68, about 69, about 70, about 71, about 72, about 73, about 74, about 75, about 76, about 78, about 80, about 82, about 84, about 85, about 86, about 88, about 90, about 92, about 94, about 95, about 96, about 98, about 100, about 102, about 104, about 106, about 108, about 110, about 112, about 114, about 116, about 120, about 118, about 122, about 124, about 126, about 122, about 74, about 75, about 76, about 80, about 82, about 84, about 85, about 86, about 90, about 122, about 30, about 24, about 20, about 30, about 20, and about 20, or more, about 20, or more, about 20, or more, about 20, about 40, about 20, about 40, about 20, or more, about 20, about 40, or more, about 20, about, or, About 132, about 134, about 136, about 138, about 140, about 142, about 145, about 146, about 148, about 150, about 152, about 154, about 155, about 156, about 158, about 160, about 162, about 164, about 165, about 166, about 168, about 170, about 172, about 174, about 175, about 176, about 178, about 180, about 185, about 190, about 195, or about 200 consecutive nucleotides.

In certain instances, an alignment call is made when a sequence segment has at least 80%, at least 85%, at least 90%, at least 95%, at least 98%, at least 99%, or 100% sequence identity or complementarity over the entire sequence read to a particular region of a reference genome (e.g., a human reference genome). In some cases, an alignment call is made when a sequence segment has at least 80% sequence identity or complementarity to a particular region of a reference genome (e.g., a human reference genome) throughout a sequence read. In some cases, an alignment call is made when a sequence segment is identical or complementary to a particular region of a reference genome (e.g., a human reference genome), has no more than 20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 base mismatch (mismatches), or has zero mismatches. In some cases, an alignment call is made when a sequence segment is identical or complementary to a particular region of a reference genome (e.g., a human reference genome) and does not match by more than 2 bases. The maximum number or percentage of mismatches, or the minimum number or percentage of similarities, may vary depending on the selection criteria, depending on the intended use and context of the various methods and systems provided herein.

In some cases, alignment of the sequence reads to a reference genome of the pathogen allows for a maximum mismatch of no more than 20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 bases. A mismatch between the mapped sequence reads and a reference genome of the pathogen may indicate the presence of nucleotide variations in the pathogen genomic sequence in the biological sample, and in other cases, may also indicate sequencing errors. Without wishing to be bound by a theory, it is possible that more than one nucleotide variation identified at a given genomic site in a biological sample may result from sequencing errors or heterogeneity in diseased cells due to the source of the cell-free pathogen nucleic acid molecules. In some cases, if more than 1, 2, or 3 nucleotide variations are identified in a given biological sample, multiple nucleotide variations at a genomic site are excluded from the analysis.

In one illustrative example, capture-enriched targeted sequencing is used to analyze cell-free viral DNA molecules with detectable plasma EBV DNA in circulation in nasopharyngeal and non-nasopharyngeal cancer subjects. A variety of capture probes can be designed to cover the entire EBV genome. In other cases, only a portion of the EBV genome may be analyzed, and the plurality of capture probes are designed to cover only a portion of the EBV genome. In the same assay, multiple capture probes may also be included in the target genomic region of the human genome. For example, the plurality of probes can comprise target human-common Single Nucleotide Polymorphism (SNP) sites and SNPs for a plurality of Human Leukocyte Antigens (HLA). In one embodiment, further probes may be designed to hybridize to other viral genomic sequences, e.g., the HPV or HBV genome.

In some cases, the pattern of variation of the pathogen genome is analyzed by directly comparing sequence reads mapped to a reference genome to the reference genome. The comparison results may be further processed in any suitable manner, for example, for clustering analysis or phylogenetic tree analysis. Useful bioinformatic tools for these analyses include MEGA4, MEGA5, CLUSTALW, Phylip, RAxML, BEAST, PhyML, TreeView, MAFFT, MrBayes, BIONJ, MLTreeMap, Newick Utilities (Newick Utilities), Phylo. Clustering analysis or phylogenetic tree analysis compares sequence reads mapped to a pathogen reference genome to one or more pathogen genomes obtained from diseased tissue or healthy subjects, or indicated as being capable or incapable of causing a pathogen-associated disease, or indicated as being effective or ineffective for causing a pathogen-associated disease.

In an illustrative example, the methods and systems provided herein include block-based mutation pattern analysis. Block-based analysis of variant patterns may comprise layering a reference genome of a pathogen into a plurality of bins ("blocks"). The plurality of sequence reads mapped to the pathogen reference genome are compared to pathogen genomes associated with diseases relative to the disease within each of the plurality of bins. In some cases, the presence of a plurality (e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 40, 50, 60, 70, 80, 90, 100, 120, 140, 160, 180, 200, 300, 400, 500, 600, 700, 800, 900, or 1000) of different pathogen genomes is compared to a block-based analysis comprising pathogen genomes associated with a disease and, optionally, pathogen genomes (pathogen genomes not associated with a disease) known or indicated to be unable to cause or ineffective to cause the pathogen-associated disease. In the block-based analysis, in each bin of the plurality of bins, a similarity index is calculated based on shared nucleotide variations between sequence reads mapped to a pathogen reference genome and each of a plurality of disease-associated pathogen genomes or a plurality of disease-unrelated pathogen genomes. The similarity index may depend on a proportion of a plurality of variation sites with the same nucleotide variation as a disease-associated or disease-unrelated pathogen genome at least one of the plurality of sequence reads mapped to the plurality of pathogen reference genomes. Based on the similarity index for each of the plurality of pathogen genomes compared with respect to the plurality of sequence reads, a bin score can be calculated based on, for example, a level of similarity reflected by the similarity index. In one example, the bin score may depend on a proportion of the plurality of similarity indices above a predetermined cutoff value. The similarity index may have a cut-off set, for example, about 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, or 0.95. A similarity index above a cut-off value can indicate that the sequence reads are "similar" to the pathogen genome to which they are compared. Based on the analysis described above, pattern analysis can then be performed on a larger scale across a pathogen genome or portion of a pathogen genome using the calculated multiple similarity indices or multiple bin scores. Clustering analysis or phylogenetic analysis similar to those described above can follow block-based analysis to predict the risk of development of pathogen-associated diseases, such as EBV-associated nasopharyngeal carcinoma.

Risk score

Some aspects of the present disclosure relate to stratification of risk of a subject developing a pathogen-associated disease based on combined considerations of one or more characteristics of a plurality of cell-free nucleic acid molecules from the pathogen in a biological sample from the subject. In some cases, a risk score is generated that indicates the risk of the subject developing a pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma).

In certain instances, the present disclosure relates to methods for identifying a pathogen in a biological sample from a subject based on a combined consideration of one or more characteristics of a plurality of cell-free nucleic acid molecules from the pathogen, and one or more factors: age of the subject, smoking habits of the subject, family history of nasopharyngeal carcinoma of the subject, multiple genotypic factors of the subject, history of diet, or race of the subject, stratification of risk of the subject developing a pathogen-associated disease. In subjects in which no nasopharyngeal carcinoma was clinically detected, there may be a positive correlation between the positive rate of plasma EBV DNA detection and the age of the subject. The subject's smoking habits increase the risk of the subject for nasopharyngeal carcinoma. Subjects with a family history of nasopharyngeal carcinoma are at higher risk of developing nasopharyngeal carcinoma themselves. Genotypic factors may also be associated with the risk of nasopharyngeal cancer, such as HLA status, e.g., Bei et al, Nat genet. journal, 2010; 599-; 1780-9, each of which is incorporated herein in its entirety. Furthermore, the history of diet may be associated with a risk of nasopharyngeal cancer, e.g. subjects consuming large quantities of salted fish are at a relatively high risk of nasopharyngeal cancer. Certain ethnicities, such as the Guangdong, may also be associated with a high risk of developing nasopharyngeal carcinoma.

In some cases, the methods and systems further comprise generating a report indicating a risk of the subject developing a pathogen-associated disease. Such reports may have numerical risk score values or explicit risk assessments. In some cases, the report contains a recommendation for the frequency of screening or a future time point for a subsequent screening assay. The reports may be provided to the subject, to a medical institution or medical professional servicing the subject, or to any relevant third party, such as a medical insurance company. The story may be reviewed, evaluated, or edited by a certified physician before or after the release of the story. In some cases, the certified physician may provide additional opinions on the risk assessment or participate in the final risk assessment based on his/her medical opinions or a variety of independent examinations.

In some cases, the present disclosure provides methods for stratifying the risk of developing a pathogen-associated disease, such as a pathogen-associated proliferative disease, e.g., EBV-associated nasopharyngeal carcinoma, by using a classifier. Such a classifier can take as input data one or more of the factors described herein and provide an output comprising a risk score, which can indicate a subject's risk of developing a pathogen-associated disease. The one or more factors that may be input into the classifier may include one or more characteristics of a plurality of cell-free pathogen nucleic acid molecules, one or more characteristics of a plurality of cell-free nucleic acid molecules from a pathogen in a biological sample from a subject, and one or more factors: age of the subject, smoking habits of the subject, family history of nasopharyngeal carcinoma of the subject, genotype factors of the subject, history of diet, and ethnicity of the subject. The risk score as an output of the classifier may indicate a risk that the subject currently suffers from or will develop into a pathogen-associated disease in the future. In some cases, the risk score indicates that the subject may currently have a pathogen-associated disease. In some cases, the risk score indicates the likelihood that the subject will develop the pathogen-associated disease within a future period of time (such as, but not limited to, 1 year, 2 years, 3 years, 4 years, 5 years, 10 years, or 15 years). In some cases, the output provided by the classifier includes a recommended screening frequency or future point in time for subsequent screening analysis. Such output may be in the form of a clinical recommendation, or may be provided to the subject, a medical institution or medical professional, or any third party, such as a medical insurance company, in a report as described above.

As described herein, a classifier may refer to any algorithm that performs classification. In the present disclosure, the classifier can be a classification model constructed based on any suitable algorithm for predicting the risk of future development of a disease associated with a pathogen. Suitable algorithms may include machine learning algorithms and other mathematical/statistical models, such as, but not limited to, Support Vector Machines (SVMs), naive bayesLogistic regression, random forest, decision tree, gradient boosting tree, neural network, deep learning, linear/kernel Support Vector Machine (SVM), linear/nonlinear regression, linear discriminant analysis, and the like. In some cases, the classifier is trained using a labeled data set containing a plurality of input-output pairs. For example, a data set generated from analysis of a sample of a number of subjects who have been diagnosed with no nasopharyngeal carcinoma or with nasopharyngeal carcinoma. In these cases, the dataset may comprise data sets having one or more factors from the subject's plasma EBV DNA characteristics (e.g., pattern of variation, methylation status, detectability/copy number, or fragment size), age, family history, smoking habits, race, or dietary historyAn input, and a corresponding output, indicating whether the corresponding subject has nasopharyngeal carcinoma. In an illustrative example, a classifier can be trained using a labeled data set containing a large number of input-output pairs (e.g., at least 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, or 20000 pairs).

In one example, a classification model is provided to predict the risk of future nasopharyngeal carcinoma development in subjects with detectable plasma EBV DNA using analysis of patterns of variation. The classification model may be a classifier constructed using a Support Vector Machine (SVM) algorithm as follows:

given a training data set containing n samples:

(M1,Y1),…,(Mn,Yn)

wherein Yi represents the nasopharyngeal carcinoma status of sample i. Yi is 1 for samples from patients with nasopharyngeal carcinoma) or-1 for samples from subjects without nasopharyngeal carcinoma; mi is a p-dimensional vector containing the virus variation pattern of sample i. For example, Mi can be a series of multiple variant sites (e.g., the 29 variant sites associated with nasopharyngeal carcinoma or the 661 variant sites associated with nasopharyngeal carcinoma listed in table 6). Alternatively, Mi can be a series of block-based variant similarity scores (e.g., non-overlapping windows of 500 bp) for multiple reference EBV variants present in subjects known to have nasopharyngeal carcinoma.

A "hyperplane" can be identified that separates the non-nasopharyngeal carcinoma group from the nasopharyngeal carcinoma group as accurately as possible in the training dataset, by finding a set of coefficients (W and p-dimensional vectors) that satisfies the following condition:

Criterion 1:

W·Mib.gtoreq.1 (for any subject in the nasopharyngeal carcinoma group)

And

criterion 2

W·Mi-b.ltoreq.1 (for any subject in the non-nasopharyngeal cancer cohort)

Wherein W is a p-dimensional vector of a plurality of coefficients that determine a hyperplane; m is a matrix (p x n dimension) with p variables (or multiple block-based similarity scores) and n samples; b is the intercept.

These two criteria (i.e., criteria 1 and 2) can also be written as:

yi (W star Mi-b)1(≧ criterion 3)

Wherein Yi is-1 (non-nasopharyngeal carcinoma) or 1 (nasopharyngeal carcinoma).

The boundary distance (D) between criteria 1 and 2 is:

where W is the distance calculation using the point-to-plane equation.

According to criterion 3, D is maximized by minimizing W.

Based on this principle, the parameters (W and b) of the classifier can be determined. Thus, a trained classifier performed using trained parameters (W and b) can be used to calculate nasopharyngeal carcinoma risk scores for a plurality of test samples.

In one example, the nasopharyngeal cancer risk score is calculated as a weighted sum (as an explanatory variable in a binary logistic regression model) of a plurality of EBV genotypes at a fixed set of SNV sites across the viral genome. In the example, a nasopharyngeal carcinoma-associated SNV set is identified by analyzing differences in a plurality of EBV SNV gene profiles (profiles) from nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples in a training set. The association of each variation in the EBV genome with a case of nasopharyngeal carcinoma can be analyzed, for example, using Fisher's exact test. A fixed set of significant SNVs can then be obtained, e.g., with False Discovery Rates (FDRs) controlled at 5%. The nasopharyngeal cancer risk score of a test sample can be determined by its multiple EBV genotypes on a particular set of multiple significant SNV sites (significant SNV sites) identified from a training set containing sequencing data from plasma DNA samples of subjects with known nasopharyngeal and non-nasopharyngeal cancers. In some cases, the concentration of EBV DNA molecules in plasma may be low, and thus the entire EBV genome may not be completely covered by multiple sequenced EBV DNA reads. The score can be determined by the genotype pattern of the SNV sites covered by multiple plasma EBV DNA reads (e.g., with available genotype information). To derive the nasopharyngeal cancer risk score, a significant subset of SNV sites covered by multiple plasma EBV DNA reads in the sample can first be identified, and then the weights (effect size) for multiple genotypes for each site can be determined within the significant subset of SNV sites. The logistic regression model can be constructed to provide information on the magnitude of the effect of multiple risk genotypes at each SNV site of nasopharyngeal carcinoma:

It can be rewritten as:

wherein n is the number of significant SNV sites beta0(ii) a And betakIs a plurality of coefficients that can be determined by a maximum likelihood estimator; p is the probability of nasopharyngeal carcinoma of EBV positive patients; this variable XkRepresenting the SNV site at genomic position k. If there is a variation in the sample that is identical to the EBV reference genome, XkCan be encoded as-1. If there are surrogate variables in the sample, XkCan be coded as 1. If the sample does not contain the mutation site to be analyzed, XkIt may be encoded as 0. Thus, the coefficient β0And betakCan be estimated, for example, using the 'Logistic Regression' function in python. This can be done by analyzing the genotype pattern at each site in the nasopharyngeal and non-nasopharyngeal cancer samples in the training dataset. Thus, the nasopharyngeal carcinoma risk score of a test sample can be derived from its possessed genotype at multiple SNV sites and by derivation from a training model the corresponding coefficient β0And betakThe weighting is performed.

Biological sample

Biological samples used in the various methods provided herein can include any tissue or material obtained from a living or dead subject. The biological sample may be a cell-free sample. The biological sample may comprise nucleic acids (e.g., DNA or RNA) or fragments thereof. The nucleic acid in the sample may be cell-free nucleic acid. The sample may be a liquid sample or a solid sample (e.g., a cell or tissue sample). The biological sample may be a bodily fluid, such as blood, plasma, serum, urine, mouth wash, nasal brush sample, vaginal fluid, hydrocele (e.g., hydrocele), vaginal wash, pleural fluid, ascites, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage, nipple discharge, aspirates from various parts of the body (e.g., thyroid, breast), etc. Fecal samples may also be used. In various examples, a majority of the DNA in a biological sample that has been enriched for cell-free DNA (e.g., a plasma sample obtained by a centrifugation protocol) can be cell-free (e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell-free). Biological samples can be processed to physically disrupt tissue or cellular structures (e.g., centrifugation and/or cell lysis) to release intracellular components into a solution that can further include enzymes, buffers, salts, detergents, and the like used to prepare the sample for analysis.

The various methods and systems provided herein can be used to analyze nucleic acid molecules in a biological sample. The nucleic acid molecule can be a plurality of cellular nucleic acid molecules, a plurality of acellular nucleic acid molecules, or both. The cell-free nucleic acids used by the various methods provided herein can be extracellular nucleic acid molecules in a biological sample. Cell-free nucleic acid molecules can be present in various body fluids, such as blood, saliva, semen, and urine. Cell-free DNA molecules can be produced as a result of cell death in various tissues due to health conditions and/or diseases such as viral infections and tumor growth. The plurality of cell-free nucleic acid molecules can comprise a sequence resulting from a pathogen integration event.

The plurality of cell-free nucleic acid molecules, e.g., cell-free DNA, used in the methods described herein can be present in plasma, urine, saliva, or serum. Cell-free DNA can occur naturally in the form of short fragments. Cell-free DNA fragmentation refers to the process by which high molecular weight DNA (e.g., DNA in the nucleus of a cell) is cut, fragmented, or digested into short fragments when cell-free DNA molecules are produced or released. The various methods and systems provided herein can be used to analyze cellular nucleic acid molecules in certain circumstances, for example, cellular DNA from tumor tissue, or from leukocytes when a patient has leukemia, lymphoma, or myeloma. According to some examples of the present disclosure, various analyses (assays and analyses) may be performed on samples taken from tumor tissue.

Object

The various methods and systems provided herein can be used to analyze a sample from a subject (e.g., an organism, such as a host organism). The subject can be any human patient, such as a cancer patient, a patient at risk of cancer, or a patient with a family or personal history of cancer. In some cases, the subject is at a particular stage of cancer treatment. In certain instances, the subject may have or be suspected of having cancer. In some cases, it is unknown whether the subject has cancer.

In certain instances, the subject receives or does not receive treatment for a pathogen-associated disease based on the results of the screening assays provided herein. In one example, while the first screening analysis shows a positive result, indicating a high risk of the subject developing a pathogen-associated disease, the subject is diagnosed without the pathogen-associated disease (e.g., EBV-associated nasopharyngeal carcinoma) by a subsequent diagnostic check. In this case, the subject is not receiving medical treatment, such as, but not limited to, treatment with a therapeutic agent (e.g., chemotherapy), radiation therapy, surgery, or any combination thereof. In another example, the subject is screened for a high risk of developing a pathogen-associated disease (e.g., HPV-associated cervical cancer) and is further diagnosed as having the disease. As a result, the subject may receive medical treatment for the disease, such as, but not limited to, surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, or any combination thereof.

Pathogen-associated diseases to which the various methods and systems provided herein are applicable may include proliferative diseases, such as cancer. These diseases may be associated with or caused by pathogens such as viruses, bacteria or fungi. Viruses that may be associated with the diseases described herein may include EBV, Kaposi's sarcoma-associated herpesvirus (KSHV), HPV (such as, but not limited to, HPV 16, 18, 31, 33, 34, 35, 39, 45, 51, 52, 56, 58, 59, 66, 68, and 70) (Burd et al, Clin Microbiol Rev, 2003: 16:1-17), Merck's Cell Polyoma Virus (MCPV), HBV, HCV, and human T-lymphotrophic virus-1 (HTLV 1). Suitable pathogen-associated cancers include Burkitts lymphoma, Hodgkins lymphoma, immunosuppressive-related lymphoma, T cell and NK cell lymphoma; nasopharyngeal carcinoma or gastric cancer, possibly associated with EB virus. Suitable pathogen-associated cancers may include primary effusion lymphoma or kaposi's sarcoma, which may be associated with KSHV. Suitable pathogen-associated cancers may include cervical cancer, head and neck cancer, or anogenital tract cancer, which may be HPV-associated. Suitable pathogen-associated cancers may include merkel cell carcinoma associated with MCPV. Suitable pathogen-associated cancers may comprise HCC associated with HBV or Hepatitis C Virus (HCV). Suitable pathogen-associated cancers may include adult T-cell leukemia/lymphoma associated with HTLV 1.

The subject may have, or be at risk of having, any type of cancer or tumor. For example, the subject may have nasopharyngeal carcinoma or a cancer of the nasal cavity. In another example, the subject may have oropharyngeal cancer or cancer of the oral cavity. Non-limiting examples of cancer may include, but are not limited to, adrenal cancer, anal cancer, basal cell carcinoma, cholangiocarcinoma, bladder cancer, leukemia, bone cancer, brain tumor, breast cancer, bronchial cancer, cardiovascular cancer, cervical cancer, colon cancer, colorectal cancer, digestive system cancer, endocrine system cancer, endometrial cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal tumor, hepatocellular carcinoma, kidney cancer, hematopoietic malignancy, laryngeal cancer, leukemia, liver cancer, lung cancer, lymphoma, melanoma, mesothelioma, muscular system cancer, myelodysplastic syndrome (MDS), myeloma, nasal cavity cancer, nasopharyngeal cancer, nervous system cancer, lymphatic system cancer, oral cavity cancer, oropharyngeal cancer, osteosarcoma, ovarian cancer, pancreatic cancer, penile cancer, pituitary tumor, prostate cancer, rectal cancer, renal pelvis cancer, reproductive system cancer, respiratory system cancer, sarcoma, bone cancer, liver cancer, kidney cancer, and lung cancer, Salivary gland cancer, skeletal system cancer, skin cancer, small intestine cancer, gastric cancer, testis cancer, larynx cancer, thymus gland cancer, thyroid cancer, urinary system cancer, uterus cancer, vaginal cancer or vulva cancer. The lymphoma can be any type of lymphoma, including a B cell lymphoma (e.g., diffuse large B cell lymphoma, follicular lymphoma, small lymphocytic lymphoma, mantle cell lymphoma, marginal zone B cell lymphoma, burkitt's lymphoma, lymphoplasmacytic lymphoma, hairy cell leukemia, or primary central nervous system lymphoma) or a T cell lymphoma (e.g., precursor T lymphocytic lymphoma or peripheral T cell lymphoma). The leukemia may be any type of leukemia, including acute leukemia or chronic leukemia. The types of leukemia include acute myelogenous leukemia, chronic myelogenous leukemia, acute lymphocytic leukemia, acute undifferentiated leukemia, or chronic lymphocytic leukemia. In some cases, a cancer patient does not have a particular type of cancer. For example, in some cases, a patient may have a cancer that is not breast cancer.

Examples of cancer include cancers that cause solid tumors and cancers that do not. Furthermore, any cancer referred to herein may be a primary cancer (e.g., a cancer named after the part of the body in which it initially begins to grow) or a secondary or metastatic cancer (e.g., a cancer originating from another part of the body).

The subject diagnosed by any of the methods described herein can be of any age, and can be an adult, an infant, or a child. In some cases, the subject is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 78, 80, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99, or within an age range (e.g., 2 to 20, 40, or 40 years). A particular class of patients that may benefit may be those over the age of 40. Another particular class of patients that may benefit may be pediatric patients. Further, the subject diagnosed by any of the methods or combinations described herein can be male or female.

In some embodiments, the methods of the present disclosure can detect a tumor or cancer in a subject, wherein the tumor or cancer has a geographic pattern of disease. In one example, a subject may have an EBV-associated cancer (e.g., nasopharyngeal carcinoma), which is ubiquitous in southern china (e.g., hong kong). In another example, the subject may have an HPV-associated cancer (e.g., oropharyngeal cancer), which may be prevalent in the united states and western europe. In another example, a subject may have an HTLV-1 associated cancer (e.g., adult T-cell leukemia/lymphoma), which may be prevalent in southern japan, caribbean, central africa, parts of south america, and some immigration groups in the southeast united states.

Any of the methods disclosed herein can also be performed on a non-human subject (e.g., a laboratory or farm animal) or a cell sample derived from an organism disclosed herein. Non-limiting examples of such non-human subjects include dogs, goats, guinea pigs, hamsters, mice, pigs, non-human primates (e.g., gorilla, ape, orangutan, lemurs, or baboons), mice, sheep, cows, or zebrafish.

Computer system

Any of the methods disclosed herein can be performed and/or controlled by one or more computer systems. In some examples, any of the steps of the various methods disclosed herein can be performed and/or controlled entirely, separately or sequentially by one or more computer systems. Any computer system mentioned herein may utilize any suitable number of subsystems. In some embodiments, the computer system comprises a single computer device, wherein the subsystem may be multiple components of the computer device. In other embodiments, a computer system may contain multiple computer devices, each being a subsystem having multiple internal components. Computer systems may include desktop and laptop computers, tablet computers, mobile phones, and other mobile devices.

The subsystems may be interconnected by a system bus. Other subsystems include a printer, a keyboard, a storage device, and a display coupled to a display adapter. Peripheral devices and input/output (I/O) devices coupled to the I/O controller may be connected to the computer system by any number of connections known in the art, such as input/output (I/O) ports (e.g., USB, and the like,). For example, an I/O port or an external interface (e.g., Ethernet, Wi-Fi, etc.) can be used to connect the computer system to a wide area network, such as the Internet, a mouse input device, or a scanner. The interconnection via a system bus allows the central processor to communicate with each subsystem and to control the execution of instructions from a system memory or storage device, such as a fixed magnetic disk (e.g., a hard drive or optical disk), and the exchange of information between the subsystems. The system memory and/or storage device may be implemented as computer-readable media. Another subsystem is a data collection device such as a camera, microphone, accelerometer, etc. Any data mentioned herein may be output from one component to another component, and may be output to a user.

The computer system may contain a number of identical components or subsystems, connected together, for example, through an external interface or an internal interface. In some embodiments, computer systems, subsystems, or devices may communicate over a network. In this case, one computer may be considered a client and another computer may be considered a server, where each computer may be part of the same computer system. The client and server may each contain multiple systems, subsystems, or components.

The present disclosure provides a computer control system programmed to carry out the method of the present disclosure for stratifying risk of pathogen-associated disease. Fig. 21 illustrates a computer system 1101 that is programmed or otherwise configured to analyze a plurality of cell-free nucleic acid molecules or sequence reads thereof, analyze other factors associated with risk of disease, assess risk, or generate reports indicative of risk as described herein. The computer system 1101 can perform and/or regulate various aspects of the methods provided in the present disclosure, for example, controlling sequencing of nucleic acid molecules from a biological sample, performing various steps of bioinformatic analysis of sequencing data as described herein, integrating data collection, analysis and result reporting, and data management. The computer system 1101 may be the user's electronic device or a computer system remotely located from the electronic device. The electronic device may be a mobile electronic device.

The computer system 1101 contains a central processing unit (CPU, also referred to herein as "processor" and "computer processor") 1105, which may be a single or multi-core processor, or multiple processors for parallel processing. Computer system 1101 also contains memory or memory locations 1110 (e.g., random access memory, read only memory, flash memory), an electronic storage unit 1115 (e.g., hard disk), a communication interface 1120 (e.g., a network adapter) for communicating with one or more other systems, and peripheral devices 1125 such as a cache, other memory, data storage, and/or an electronic display adapter. Memory 1110, storage 1115, interface 1120, and peripherals 1125 communicate with CPU 1105 through a communication bus (physical wires), such as a motherboard. The storage unit 1115 may be a data storage unit (or data store) for storing data. Computer system 1101 may be operatively coupled to a computer network ("network") 1130 by way of a communication interface 1120. The network 1130 may be the internet, an internet and/or an extranet, or an intranet and/or extranet in communication with the internet. In some cases, network 1130 is a telecommunications and/or data network. The network 1130 may include one or more computer servers, which may perform distributed computing, such as cloud computing. In some cases, a peer-to-peer network may be implemented with the computer system 1101, the network 1130, which may cause multiple devices coupled to the computer system 1101 to act as clients or servers.

The CPU 1105 may execute a sequence of machine-readable instructions, which may be embodied in a program or software. The instructions may be stored in a memory location, such as memory 1110. These instructions may be directed to the CPU 1105, which may then program or otherwise configure the CPU 1105 to perform the methods of the present disclosure. Examples of operations performed by the CPU 1105 may include fetch, decode, execute, and write-back.

The CPU 1105 may be part of a circuit, such as an integrated circuit. One or more other components of the system 1101 may be included in a circuit. In some cases, the circuit is an Application Specific Integrated Circuit (ASIC).

The storage unit 1115 may store files such as drivers, libraries, and saved programs. The storage unit 1115 may store user data, such as user preferences and user programs. In some cases, the computer system 1101 may contain one or more additional data storage units external to the computer system 1101, for example, located on a remote server that communicates with the computer system 1101 over an intranet or the internet.

The computer system 1101 may communicate with one or more remote computer systems over a network 1130. For example, the computer system 1101 may communicate with a remote computer system of a user (e.g., a smartphone equipped with an application that receives and displays the sample analysis results sent from the computer system 1101). Examples of remote computer systems include a variety of personal computers (e.g., a laptop PC), a tablet PC or a tablet PC (e.g., a tablet PC) Tab), telephone, smartphone (e.g. smartphone)An android device,) Or a personal digital assistant. A user may access computer system 1101 via network 1130.

The various methods described herein may be carried out by machine (e.g., computer processor) executable code stored on an electronic storage location (e.g., memory 1110 or electronic storage unit 1115) of the computer system 1101. The machine executable or machine readable code may be provided in the form of software. During use, the code may be executed by the processor 1105. In some cases, code may be retrieved from storage 1115 and stored in memory 1110 for ready access by processor 1105. In some cases, electronic storage unit 1115 may be eliminated, and the machine-executable instructions stored on memory 1110.

The code may be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or may be compiled at runtime. The code may be provided in a programming language, which may be selected to cause the code to be executed in a pre-compiled or compiled form.

Aspects of the systems and methods provided herein, such as the computer system 1101, may be implemented in programming. Various aspects of the technology may be considered an "article of manufacture" or an "article of manufacture" typically in the form of machine (or processor) executable code and/or associated data carried or embodied in a machine-readable medium. The machine executable code may be stored on an electronic storage unit, such as a memory (e.g., read only memory, random access memory, flash memory) or a hard disk. A "memory" type medium may include any or all tangible memory of a computer, processor, etc., or its associated modules, such as various semiconductor memories, tape drives, disk drives, etc., that may provide non-transitory storage for software programming at any time. All or portions of the software may sometimes communicate over the internet or various other telecommunications networks. For example, such communication may enable loading of software from one computer or processor to another computer or processor, such as from a management server or host to the computer platform of an application server. Thus, another type of media which may carry software elements includes optical, electrical, and electromagnetic waves, such as used through physical interfaces between wired and fiber-optic fixed-network networks and various air-links connecting local devices. Physical components carrying such waves, such as wired or wireless links, optical links, etc., may also be considered as media carrying software. As used herein, unless limited to a non-transitory, tangible "storage" medium, terms such as a "readable medium" of a computer or machine refer to any medium that participates in providing instructions to a processor for execution.

Thus, a machine-readable medium, such as computer executable code, may take many forms, including but not limited to tangible storage media, carrier wave media, or physical transmission media. Non-volatile storage media include, for example, optical or magnetic disks, any storage device in any computer, etc., such as may be used to implement the databases and the like shown in the figures. Volatile storage media include dynamic memory, such as the main memory of a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus in a computer system. Carrier-wave transmission media can take the form of electrical or electromagnetic signals, or acoustic or light waves, such as those generated during Radio Frequency (RF) and Infrared (IR) data communications. Thus, common forms of computer-readable media include, for example: floppy disks (flexible disks), hard disks, magnetic tape, any other magnetic medium, CD-ROMs, DVD or DVD-ROMs, any other optical medium, punch cards, any other physical storage medium with patterns of holes, RAMs, ROMs, PROMs, and EPROMs, flash-EPROMs, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such carrier waves, or any other medium from which a computer can read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

The computer system 1101 may include or be in communication with an electronic display 1135, the electronic display 1135 including a User Interface (UI)1140 for providing, for example, sample analysis results such as, but not limited to, a graphical display of pathogen integration profiles, genomic locations of pathogen integration breakpoints, disease classifications (e.g., disease or cancer type and cancer level), and treatment or preventative action recommendations based on the disease classifications. Examples of UIs include, but are not limited to, Graphical User Interfaces (GUIs) and web-based user interfaces.

The various methods and systems of the present disclosure may be carried out by one or more algorithms. The algorithms may be implemented in software when executed by the central processing unit 1105. For example, the algorithm can control sequencing of nucleic acid molecules from a sample, collect sequencing data directly, analyze sequencing data, perform block-based analysis of variation patterns, assess risk, or generate reports indicative of risk.

In some cases, as shown in fig. 22, a sample 1202 may be obtained from a subject 1201 (e.g., a human subject). The sample 1202 may be subjected to one or more methods as described herein, e.g., performing an analysis. In some cases, the analysis may comprise hybridization, amplification, sequencing, labeling, epigenetically modified bases, or any combination thereof. One or more results from the methods may be input into the processor 1204. One or more input parameters (e.g., sample identification, object identification, sample type, reference, or other information) may be input into the processor 1204. One or more metrics from the analysis may be input to the processor 1204 so that the processor may generate results, such as a classification of a condition (e.g., a diagnosis) or a treatment recommendation. The processor may send the results, input parameters, metrics, references, or any combination thereof to a display 1205, such as a visual display or graphical user interface. The processor 1204 may: (i) send the results, input parameters, metrics, or any combination thereof to the server 1207, (ii) receive the results, input parameters, metrics, or any combination thereof from the server 1207, (iii) or a combination thereof.

Aspects of the present disclosure may be implemented in the form of control logic using hardware (e.g., an application specific integrated circuit or a field programmable gate array) and/or using computer software with a general programmable processor in a modular or integrated manner. As used herein, a processor includes a single-core processor, a multi-core processor on the same integrated chip, or multiple processing units on a single circuit board or network. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the various embodiments described herein using hardware and a combination of hardware and software.

Any software component or function described in an application may be implemented as a software processor using any suitable computer language such as that used by java, C + +, C, objava C, Swift, or a scripting language (e.g., Perl or Python), for example, using conventional or object-oriented techniques. The software code may be stored on a computer readable medium as a series of instructions or commands for storage and/or transmission. Suitable non-transitory computer readable media may include Random Access Memory (RAM), Read Only Memory (ROM), magnetic media such as a hard drive or floppy disk, or optical media such as a Compact Disc (CD) or DVD (digital versatile disc), flash memory, and the like. The computer readable medium may be any combination of such memory or transmission devices.

Such programs may also be encoded and transmitted using carrier wave signals adapted for transmission via a wired, optical and/or wireless network conforming to various protocols, including the internet. Accordingly, a computer readable medium may be created using data signals encoded with such a program. The computer readable medium encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via internet download). Any such computer-readable medium may reside on or within a single computer product (e.g., a hard disk, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. The computer system may include a display, printer, or other suitable display for providing any of the results described herein to a user.

Any of the methods described herein may be performed in whole or in part using a computer system comprising one or more processors, which may be configured to perform the various steps. Accordingly, embodiments may be directed to a computer system configured to perform the steps of any of the methods described herein, wherein different components perform the respective steps or respective groups of steps. Although presented as numbered steps, the method steps herein may be performed simultaneously or in a different order. Further, some of these steps may be used with some of the other steps from other methods. Further, all or part of the steps may be optional. Additionally, any steps of any method may be performed using modules, units, circuits, or other methods for performing the steps.

Other embodiments

The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter presented.

It is to be understood that the methods described herein are not limited to the particular methods, protocols, objects, and sequencing techniques described herein, and thus may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the various methods and combinations described herein, which will be limited only by the appended claims. While some embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.

For purposes of illustration, several aspects are described with reference to an example application. Any embodiment may be combined with any other embodiment unless otherwise indicated. It should be understood that numerous specific details, relationships, and methods are set forth to provide a full understanding of the features described herein. One skilled in the relevant art will readily recognize, however, that the features described herein can be practiced without one or more of the specific details, or with other methods. The features described herein are not limited by the illustrated ordering of acts or events, as some acts may occur in different orders and/or concurrently with other acts or events. Moreover, not all illustrated acts or events are required to be implemented as a method in accordance with the features described herein.

Examples of the invention

The following examples are provided to further illustrate some embodiments of the present disclosure, but are not intended to limit the scope of the present disclosure; it will be appreciated by way of example that other procedures, methods or techniques known to those skilled in the art may alternatively be used.

EXAMPLE 20000 subjects were screened for nasopharyngeal carcinoma within 14 years

This example describes a large-scale screening study that conducted cohort (cohort) studies on more than 20000 subjects over a period of approximately 4 years. Figure 1 shows the design of the study. In the first round of screening, 20000 men aged between 40 and 62 years were screened for nasopharyngeal carcinoma using plasma EBV DNA analysis. Subjects with detectable plasma EBV DNA were retested with the second group of blood samples after a median of 4 weeks. The purpose of this arrangement is to distinguish patients with nasopharyngeal carcinoma from those without nasopharyngeal carcinoma in which plasma EBV DNA is detectable. In a previous study, studies have shown that the presence of plasma EBV DNA is often a transient phenomenon in subjects without nasopharyngeal carcinoma. In two thirds of these individuals, plasma EBV DNA became undetectable after two weeks. Subjects who consistently positive for plasma EBV DNA results were further studied using intranasal endoscopy and nasopharyngeal Magnetic Resonance Imaging (MRI) to confirm or rule out the presence of nasopharyngeal carcinoma. According to this arrangement, 34 nasopharyngeal carcinomas were identified.

Subsequently, another round (second round) of nasopharyngeal carcinoma screening was performed on the cohort with a median of 4 years after the first round of screening. In the second round of nasopharyngeal carcinoma screening, as in the first round of screening, subjects with positive test results will be tested again after about 4 weeks. Subjects who are positive for 4 weeks of two consecutive tests will receive further examination by nasal endoscope and MRI. The second round of screening began in 2017. By 2018, 9, 15, a total of 8335 subjects completed the second round of screening. 784 (9.4%) subjects were positive for plasma EBV DNA. Plasma EBV DNA was still detectable in 230 (2.7%) subjects in the 4 week retest. Table 1 summarizes the results of the two rounds of nasopharyngeal carcinoma screening.

TABLE 1 plasma EBV DNA status in first and second nasopharyngeal carcinoma screening rounds

As shown in Table 1, the probability that plasma EBV DNA could be detected in the second round of nasopharyngeal carcinoma screening correlated with the status of plasma EBV DNA in the first round of screening. In the first round of screening, subjects with plasma EBV DNA negative, transient positive, and persistent positive were detectable with 8%, 21%, and 57% probability in the initial analysis of the second round of screening. Furthermore, after 4 weeks, the probability of having a persistent positive for plasma EBV DNA increased gradually from 2% to 25% in the three groups.

The staging profile of patients with nasopharyngeal carcinoma as determined by the screening described herein is much earlier than patients in a historical cohort that did not receive a nasopharyngeal carcinoma screen. The percentage of early stage disease (stage I and II) was 70% and 20%, respectively. This change in the staging profile resulted in a significant improvement in progression-free survival for the patient with a risk ratio of 0.1. Table 2 summarizes the staging of nasopharyngeal carcinoma cases in the first and second rounds of screening. After a second round of screening 8335 subjects, 13 new cases of nasopharyngeal carcinoma were found. The percentage of early stage disease patients in the first and second rounds of screening was 71% and 69%, respectively. The percentage of patients with early stage disease did not differ significantly (P0.93, chi-square test).

Staging of nasopharyngeal carcinoma patients in two rounds of screening in Table 2

Staging First round of screening Second round of screening
I 16(47%) 4(31%)
II 8(24%) 5(38%)
III 8(24%) 4(31%)
IV 2(6%) 0(0%)

As summarized in table 3, in the second round of screening performed 4 years after the first round of screening, subjects who were transiently and persistently detectable for plasma EBV DNA in the first round of screening were detected at higher risk of having nasopharyngeal carcinoma in the second round of screening than subjects who were not detectable for plasma EBV DNA in the first round of screening. The relative risk values for these two groups were 7.2 and 19.7, respectively.

TABLE 3 number of nasopharyngeal carcinoma cases identified in the second round of screening (sorted by first round plasma EBV DNA status)

These results indicate that plasma EBV DNA analysis is not only useful for screening the current state of nasopharyngeal cancer, but also for predicting the future risk of having a clinically observable nasopharyngeal cancer. One practical application of this finding is to tailor the time interval between repeated screens based on the plasma EBV DNA status of the subjects screened earlier. For example, subjects with detectable plasma EBV DNA at baseline but no recognizable nasopharyngeal carcinoma can be rescreened after a shorter time interval than subjects with undetectable plasma EBV DNA. Also illustratively, for subjects with undetectable plasma EBV DNA, transient detectable and persistent detectable, the interval between repeated screening was 4 years, 2 years and 1 year, respectively.

Example 2 nasopharyngeal carcinoma screening based on plasma EBV-DNA detectability

This example describes a nasopharyngeal carcinoma screening regimen designed for a subject based on the detectability of EBV DNA in the subject's plasma. Fig. 2 shows a schematic diagram of a scheme as described herein.

According to this protocol, subjects with no detectable plasma EBV DNA in the early screening paradigm were rescreened after 4 years, since subjects with no detectable EBV DNA in the next 4 years are at relatively low risk of developing nasopharyngeal carcinoma. If the subsequent screen is negative for plasma EBV DNA, the interval between subsequent screens is 4 years. However, when a subject detects EBV DNA in one screen but no nasopharyngeal carcinoma, the next screen is scheduled one year later. When plasma EBV DNA persisted negative for 4 years, the screening interval returned to 4 years. The actual time interval used by a particular screening program is also adjusted based on health economics (e.g., screening costs), subject preferences (e.g., more frequent screening intervals may be more disruptive to the lifestyle of certain subjects), and other clinical parameters (e.g., personal genotype, family history of nasopharyngeal carcinoma, dietary history, ethnic descent (e.g., Guangdong).

EXAMPLE 3 analysis of variation patterns of cell-free EBV DNA molecules

In this example, targeted sequencing with capture enrichment was used to analyze nasopharyngeal carcinoma subjects, non-nasopharyngeal carcinoma subjects with detectable plasma EBV DNA and circulating cell-free viral DNA molecules in pre-nasopharyngeal carcinoma subjects (see section below for details). A variety of capture probes can be designed to cover the entire EBV genome. Also included in the same assay are multiple probes targeting about 3000 human common Single Nucleotide Polymorphism (SNP) sites and Human Leukocyte Antigen (HLA) SNPs.

In this example, plasma EBV DNA was analyzed in 13 nasopharyngeal carcinoma patients and 16 non-nasopharyngeal carcinoma subjects from which plasma EBV DNA could be detected. The 13 patients with nasopharyngeal carcinoma presented symptoms and were recruited from the clinical oncology or otorhinolaryngology of the weirs king hospital. As described in example 1, 16 non-nasopharyngeal cancer subjects were from more than 20000 subjects screening cohorts for nasopharyngeal cancer.

In this analysis, targeted sequencing with capture enrichment by specially designed capture probes was used. For each plasma sample analyzed, DNA was extracted from 4 ml of plasma using the QIAamp circulating nucleic acid kit. For each case, all extracted DNA was used for sequencing library preparation using TruSeq nano DNA library preparation kit (Illumina). Barcodes were performed using a dual-indexing system (xGen dual-tag UMI adapter, integrated DNA technology) that incorporates Unique Molecular Identifier (UMI) sequences. The adaptor-ligated samples were subjected to eight cycles of PCR amplification using TruSeq nano kit (Illumina). Amplification products were then captured using a myBait custom capture plate system (Arbor Biosciences) using custom designed probes covering the above-described viral and human genome regions. After target capture, the captured products were enriched by 14 cycles of PCR to generate a DNA library. The DNA library was sequenced on the NextSeq platform (Illumina). For each sequencing run, ten samples with unique sample barcodes were sequenced using paired-end mode. Each DNA fragment will sequence 71 nucleotides from each of the two ends. After sequencing, the sequence reads will be mapped to an artificially assembled reference sequence consisting of the entire human genome (hg19), the entire EBV genome (GenBank: AJ507799.2), the entire HBV genome, and the entire HPV genome. Alignment using SOAP2 (bioinformatics 2009; 25:1966-7) allowed a maximum of 2 mismatches per read in the correct direction without insert sizes exceeding 600 bp. Multiple sequencing reads mapped to multiple unique locations in the combined genomic sequence will be used for downstream analysis. All repeated fragments with the same unique molecular identifier will be filtered.

Based on the alignment results, nucleotide differences, including but not limited to Single Nucleotide Variations (SNV), are identified between the sequence reads and the EBV reference genome (GenBank: AJ 507799.2). Of the 44 samples from 13 nasopharyngeal carcinoma subjects, 16 non-nasopharyngeal carcinoma subjects from which plasma EBV DNA could be detected and 4 pre-nasopharyngeal carcinoma subjects recognized a median of 1116 SNVs (four-locus distance (IQR): 902-. In these plasma samples, two different alleles were observed at certain nucleotide positions of the EBV genome. This observation may be due to sequencing errors or the presence of tumor heterogeneity. Only the median of 26 positions (IQR:20-35) had more than one allele in plasma EBV DNA.

In the phylogenetic tree analysis shown in fig. 3, nasopharyngeal cancer objects are clustered together and separated from non-nasopharyngeal cancer objects. These results indicate that there is a distinct EBV variant gene profile between nasopharyngeal and non-nasopharyngeal cancer subjects. Thus, analysis of the EBV variant gene profile of plasma EBV DNA can be used to distinguish nasopharyngeal from non-nasopharyngeal cancer subjects in the screen. Three non-nasopharyngeal carcinoma subjects (AC106, AP080 and FF159) had two serially collected samples, each at 4 weeks. Two samples from the same individual clustered together, indicating that they had very similar variation.

Phylogenetic tree analysis was also based on EBV mutations, but did not contain 29 mutations reported in the study by Hui et al (Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049) on 13 patients with nasopharyngeal carcinoma and 16 non-nasopharyngeal carcinoma subjects from which plasma EBV DNA could be detected in the same cohort. As shown in fig. 4, nasopharyngeal cancer subjects were also clustered together and separated from non-nasopharyngeal cancer subjects.

4 subjects who were consistently positive for plasma EBV DNA in the first round of screening (as described in example 1) but were not detected nasopharyngeal carcinoma on endoscopy and MRI were subsequently diagnosed with nasopharyngeal carcinoma. All 4 subjects (BB096, DN054, FK015 and HB121) were diagnosed as nasopharyngeal carcinoma 3 years after the first round of screening. All 4 subjects had an additional plasma sample taken 1 year after the first round of screening during the follow-up of the otolaryngological clinic. For each of these four subjects, EBV mutation analysis was performed on two samples collected after the first round of screening and 1 year later. As shown in fig. 5, samples from pre-nasopharyngeal carcinoma subjects clustered with nasopharyngeal carcinoma samples, indicating that EBV variants associated with nasopharyngeal carcinoma existed before the actual onset of the cancer. This indicates that individuals with EBV variations associated with nasopharyngeal carcinoma are at higher risk of developing nasopharyngeal carcinoma in the future. Phylogenetic tree analysis was also based on EBV variations, but did not include the 29 variations reported in the study of Hui et al (Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049) on subjects in the same cohort with nasopharyngeal, non-nasopharyngeal and pre-nasopharyngeal carcinoma. As shown in fig. 6, the samples from the pre-nasopharyngeal carcinoma subjects still clustered with the nasopharyngeal carcinoma samples, further indicating that analysis of EBV variation could predict the risk of future nasopharyngeal carcinoma.

Example 4 Block-based mutation Pattern analysis

This example describes the working principles of an exemplary block-based mutation pattern analysis method and its application in the EBV mutation pattern analysis of the samples described in example 3.

Fig. 7 illustrates the principle of block-based (block-based) mutation pattern analysis. Block-based analysis to assess the similarity of EBV DNA variation patterns from plasma EBV DNA sequencing of different samples to a reference genome, NPC sequencing data available in public databases (Kwok et al, J Virol journal 2014; 88:10662-72, Li et al, Nat Comm journal 2017; 8:14121) were used herein as references. In the block-based analysis, the EBV genome was divided into bins of 500bp in size (344 bins total), and the pattern of variation for each bin was compared to the similarity of the 24 nasopharyngeal cancer samples in the reference set. As an example, if there are 8 variant loci within a particular bin, the alleles at those loci within that bin of the test sample are analyzed and compared to the alleles at the same loci of 24 reference samples. A similarity index is derived based on the proportion of alleles that are identical to the reference sample. For example, if a test sample and a reference sample have identical alleles at 7 of 8 variation sites, the bin similarity index to the reference sample is 7/8. The bin of test samples will have 24 similarity indices compared to 24 reference samples. A bin score is calculated based on the 24 similarity indices of the bins, the bin score representing the overall similarity of the pattern of variation to the reference sample. For example, if the cutoff value of the similarity index is set to 0.9, the bin score counts the proportion of bins for which the plurality of indexes are above the cutoff value. Thus, if only two of the 24 similarity indices are above 0.9, the bin score is 2/24. The higher the bin score, the more similar the pattern of variation for the test sample is to the set of reference samples.

FIG. 8 shows a block-based analysis of EBV DNA mutation patterns for 13 nasopharyngeal carcinomas, 16 non-nasopharyngeal carcinomas, and 4 pre-nasopharyngeal carcinoma samples. For each of the 4 pre-nasopharyngeal carcinoma subjects, samples from both time points were analyzed, thus there were a total of 8 subjects. The bin score of 344 bins of the EBV genome was derived from these samples. Unsupervised cluster analysis was performed based on the bin scores of these samples. Nasopharyngeal carcinoma samples (black) clustered together and non-nasopharyngeal carcinoma samples (marked with dots) clustered together. The EBV variant gene profile of the pre-nasopharyngeal carcinoma subject is clustered with the EBV variant gene profile of the nasopharyngeal carcinoma subject. Notably, the variant characteristics of these 4 pre-nasopharyngeal carcinoma subjects were obtained by analyzing baseline samples collected several years prior to development of nasopharyngeal carcinoma.

FIG. 9 shows a block-based analysis of EBV DNA variants based on EBV variants, excluding 29 variants reported by Hui et al (Hui et al, Int J Cancer journal 2019, doi.org/10.1002/ijc.32049) in the study of the same group of 13 nasopharyngeal carcinoma subjects, 16 non-nasopharyngeal carcinoma subjects, and 4 pre-nasopharyngeal carcinoma subjects. Also, clustering of nasopharyngeal carcinoma samples (black) was observed. In addition, the EBV variant gene profile of the pre-nasopharyngeal carcinoma subject is clustered with the EBV variant gene profile of the nasopharyngeal carcinoma subject. Clustering analysis of pre-nasopharyngeal and nasopharyngeal carcinoma samples indicated that mutation analysis could predict the future development of nasopharyngeal carcinoma. In summary, the data in examples 3 and 4 show that subjects who did not suffer from nasopharyngeal carcinoma at the time of enrollment but later developed cancer had similar patterns of EBV variation in baseline blood samples as other nasopharyngeal carcinoma patients.

Example 5 nasopharyngeal carcinoma Risk prediction Using a mathematical model

This example describes the construction of a classification model that uses analysis of variation patterns and the results of tests using the classification model to predict the risk of future development of nasopharyngeal carcinoma in subjects with detectable plasma EBV DNA.

Using a Support Vector Machine (SVM) algorithm, a classifier was constructed using a training data set containing 18 subjects without nasopharyngeal carcinoma and 8 patients with nasopharyngeal carcinoma, as described in example 4. The test data set contained 5 nasopharyngeal carcinoma patients, 5 subjects without nasopharyngeal carcinoma, and 8 samples taken from 4 subjects who were not detected nasopharyngeal carcinoma by endoscopy and MRI at the time of sample collection, but who were subsequently diagnosed as having nasopharyngeal carcinoma (labeled as pre-nasopharyngeal carcinoma), as described in example 4.

The method of SVM analysis is described as follows:

given a training data set containing n samples:

(M1,Y1),…,(Mn,Yn)

wherein Yi indicates the nasopharyngeal carcinoma status of sample i. Yi is 1 (nasopharyngeal carcinoma patient sample) or-1 (non-nasopharyngeal carcinoma object sample); mi is a p-dimensional vector containing the virus variation pattern of sample i. For example, Mi can be a series of variation sites, such as 29 variations associated with nasopharyngeal carcinoma. Alternatively, Mi can be a series of block-based variant similarity scores (e.g., non-overlapping windows of 500 bp) for multiple reference EBV variants present in subjects known to have nasopharyngeal carcinoma.

Identifying a "hyperplane" that separates the non-nasopharyngeal carcinoma group from the nasopharyngeal carcinoma group as accurately as possible in the training dataset by finding a set of coefficients (W and p-dimensional vectors) that satisfies the following condition:

criterion 1:

W·Mib.gtoreq.1 (for any subject in the nasopharyngeal carcinoma group)

And

criterion 2

W·Mi-b.ltoreq.1 (for any subject in the non-nasopharyngeal cohort)

Wherein W is a p-dimensional vector of a plurality of coefficients that determine a hyperplane; m is a matrix (p x n dimension) with p variables (or multiple block-based similarity scores) and n samples; b is the intercept.

These two criteria (i.e., criteria 1 and 2) can also be written as:

yi (W star Mi-b)1(≧ criterion 3)

Wherein Yi is-1 (non-nasopharyngeal carcinoma) or 1 (nasopharyngeal carcinoma).

The boundary distance (D) between criteria 1 and 2 is:

where W is the distance calculation using the point-to-plane equation.

According to criterion 3, D is maximized by minimizing W.

Based on this principle, the parameters (W and b) of the classifier are determined. The nasopharyngeal carcinoma risk score for each test sample was then calculated by using the training parameters (W and b).

Fig. 10A shows nasopharyngeal cancer risk scores calculated using a trained classifier based on analysis of all EBV variants using block-based variant analysis. For this analysis, the EBV genome was divided into 344 fragments of 500bp for bin scores calculation as described in example 4. The bin score is considered a feature of machine learning. The nasopharyngeal carcinoma risk score for the nasopharyngeal carcinoma samples was significantly higher than samples taken from subjects without nasopharyngeal carcinoma (mean nasopharyngeal carcinoma risk score: 0.15 vs. 0.53, p value < 0.01, student's t-test). Similarly, samples taken from subjects with pre-nasopharyngeal carcinoma had significantly higher nasopharyngeal carcinoma risk scores (mean risk score: 0.58 vs. 0.15, p value < 0.01, student's t-test) compared to subjects without nasopharyngeal carcinoma. Using a cut-off value of 0.32, samples from patients with nasopharyngeal carcinoma and pre-nasopharyngeal carcinoma subjects can be distinguished from samples without nasopharyngeal carcinoma, with 100% sensitivity and 100% specificity.

Fig. 10B shows nasopharyngeal carcinoma risk scores calculated using a trained classifier based on analysis of 29 EBV variants reported in the study by Hui et al (Hui et al, Int J Cancer journal 2019, doi. org/10.1002/ijc.32049). The nasopharyngeal carcinoma risk score for the nasopharyngeal carcinoma samples was significantly higher than samples taken from subjects without nasopharyngeal carcinoma (mean nasopharyngeal carcinoma risk score: 0.89 vs. 0.18, p value <0.01, student's t test). Similarly, samples taken from subjects with pre-nasopharyngeal carcinoma had significantly higher nasopharyngeal carcinoma risk scores than subjects without nasopharyngeal carcinoma (mean risk score: 0.57 vs. 0.18, p-value 0.02, student's t-test). Using a cutoff value of 0.6, samples from patients with nasopharyngeal carcinoma and pre-nasopharyngeal carcinoma subjects can be distinguished from samples without nasopharyngeal carcinoma, with a sensitivity of 74% and specificity of 100%.

Fig. 10C shows nasopharyngeal cancer risk scores calculated using a trained classifier based on analysis of all EBV variants using block-based variant analysis, but not including 29 variants associated with nasopharyngeal cancer previously reported by Hui et al (Hui et al, international journal for cancer 2019, doi: 10.1002/ijc.32049.) nasopharyngeal cancer risk scores for nasopharyngeal cancer samples were significantly higher than samples taken from non-nasopharyngeal cancer subjects (average nasopharyngeal cancer risk score: 0.58 vs. 0.15, p value <0.01, student's t test). Similarly, samples taken from subjects before nasopharyngeal carcinoma had significantly higher risk scores for nasopharyngeal carcinoma compared to subjects without nasopharyngeal carcinoma (mean risk score: 0.53 vs. 0.15, p value <0.01, student's t-test). Using a cut-off value of 0.31, samples taken from patients with nasopharyngeal carcinoma and those who subsequently develop nasopharyngeal carcinoma can be distinguished from samples without nasopharyngeal carcinoma, with 100% sensitivity and 100% specificity. These results indicate that the exclusion of the 29 EBV variants previously reported from the analysis did not adversely affect the accuracy of the analysis.

EXAMPLE 6 analysis of plasma EBV-DNA methylation status by bisulfite sequencing

This example illustrates the use of bisulfite sequencing to distinguish nasopharyngeal cancer patients from subjects that are not nasopharyngeal cancer but have detectable plasma EBV DNA based on the methylation status of plasma EBV DNA.

The methylation level of EBV DNA in plasma was determined in nasopharyngeal carcinoma patients and subjects without nasopharyngeal carcinoma using bisulfite sequencing. Bisulfite conversion can convert unmethylated cytosine to uracil. Methylated cytosine cannot be changed by bisulfite and can remain cytosine. During sequencing, uracil can be identified as thymine. After sequencing, the methylation status of cytosine in the context of any CpG dinucleotide can be determined by examining whether the cytosine has been changed to thymine.

The plasma EBV DNA methylation levels were determined for 10 nasopharyngeal carcinoma patients and 40 subjects who did not have cancer but who had detectable EBV DNA in their plasma (non-nasopharyngeal carcinoma subjects). For 40 non-nasopharyngeal carcinoma subjects, another blood sample was taken after 4 weeks. Of these 20 plasma EBV DNAs were negative and were labeled as having transiently positive plasma EBV DNA. The plasma EBV DNA of 20 of the middle was still positive and was labeled as having persistent positive plasma EBV DNA.

As shown in FIG. 11, the nasopharyngeal cancer patients had significantly higher EBV DNA methylation levels than cancer-free subjects with transient positive plasma EBV DNA (P value <0.01, student's t-test) and cancer-free subjects with persistent positive plasma EBV DNA (P value <0.01, student's t-test). These results indicate that analysis of plasma EBV DNA methylation helps to distinguish between patients with nasopharyngeal carcinoma and subjects without nasopharyngeal carcinoma but in which plasma EBV DNA is detectable.

EXAMPLE 7 analysis of plasma EBV-DNA methylation status Using methylation sensitive restriction enzymes

This example describes a computer simulated simulation demonstrating the use of methylation sensitive restriction enzyme analysis of plasma EBV DNA to distinguish nasopharyngeal carcinoma patients from subjects without nasopharyngeal carcinoma but in which plasma EBV DNA is detectable.

Plasma DNA bisulfite sequencing was performed on samples from a non-nasopharyngeal cancer subject and a nasopharyngeal cancer patient. 347,516 and 627,1012 EBV DNA fragments were obtained in plasma DNA of the two subjects, respectively. Plasma EBV DNA methylation levels were 48.9% and 86.3%, respectively. It was determined that about half of the plasma EBV DNA molecules contained at least one "CCGG" motif.

To simulate the digestion of plasma EBV DNA by restriction enzymes, the digestion of plasma EBV DNA molecules simulated in silico was performed depending on the methylation status in the context of the "CCGG" sequence deduced from the sulfite sequencing results. Thus, a simulated size gene profile of plasma EBV DNA digested with and without computer simulation using the methylation sensitive restriction enzyme HpaII was obtained, as shown in FIG. 14. Without enzymatic digestion, the size distribution of EBV DNA in plasma of non-nasopharyngeal carcinoma subjects was on the left of nasopharyngeal carcinoma subjects, indicating that the size distribution of non-nasopharyngeal carcinoma subjects was shorter. This difference in fragment size was also observed in the size distribution gene profile after enzymatic digestion, and in non-nasopharyngeal carcinoma subjects, the abundance of short DNA less than 50bp after enzymatic digestion was significantly increased compared to that without enzymatic digestion. For nasopharyngeal carcinoma, the proportion of <50bp DNA molecules in the samples with and without enzymatic digestion was 5.87% and 0.84%, respectively. However, for non-nasopharyngeal carcinoma subjects, the proportion of <50bp DNA molecules in the samples with and without enzymatic digestion was 22.24% and 4.99%, respectively. For nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects, the proportion of <50bp DNA after enzymatic digestion was increased by 17.2% and 5.0%, respectively. FIG. 15 shows cumulative size distribution of plasma EBV DNA before and after methylation sensitive restriction enzyme digestion in nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects. The difference in the degree of enzymatic digestion can be more easily understood by the relationship of the cumulative frequency curve with respect to size. The gap between the two curves with and without enzymatic digestion reflects the degree of digestion. The larger the gap, the greater the degree of digestion of plasma EBV DNA by the enzyme, thus indicating a lower level of methylation in plasma EBV DNA. As shown, the gap was larger in the non-nasopharyngeal carcinoma subjects compared to the nasopharyngeal carcinoma patients. The maximum distance between the curves of non-enzymatic digestion and enzymatic digestion of nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects is 8.1 and 18.3, respectively; the area between the two curves for nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects is 2395 and 942.9, respectively.

EXAMPLE 8 SNV Gene profiling of cell-free EBV-DNA molecules

The differences in the EBV SNV gene profiles between the two groups were analyzed in a training data set containing plasma DNA sequencing data for 63 nasopharyngeal cancer subjects and 88 non-nasopharyngeal cancer subjects. Multiple differential SNVs across the EBV genome were identified. It is suggested that a nasopharyngeal cancer risk score be derived from the genotype pattern of these SNV sites, followed by analysis in a test set of 31 nasopharyngeal cancer samples and 40 non-nasopharyngeal cancer samples. In this example, a significant SNV was identified in a total of 661 EBV genomes from the training set (FIG. 16D). In the test set, nasopharyngeal carcinoma plasma samples showed higher nasopharyngeal carcinoma risk scores; there may be a nasopharyngeal carcinoma associated EBV SNV gene profile. In non-nasopharyngeal carcinoma samples, nasopharyngeal carcinoma risk scores range widely. Non-nasopharyngeal cancer subjects may have different EBV SNV characteristics.

Materials and methods.

Study participants and design.

The study involved the study of the previous 2018 report at Lam et al, national academy of sciences; 115 subsets of the nasopharyngeal carcinoma and non-nasopharyngeal carcinoma plasma sample sequencing datasets reported in E5115-E5124 (as training sets), and newly sequenced plasma DNA samples from nasopharyngeal carcinoma and non-nasopharyngeal carcinoma subjects (as test sets) were analyzed.

The training data set was contained in Lam et al, proceedings of the American academy of sciences 2018; plasma samples of nasopharyngeal carcinoma patients and non-nasopharyngeal carcinoma subjects selected in a previous prospective nasopharyngeal carcinoma screening study described in E5115-E5124. These non-nasopharyngeal cancer subjects were tested for plasma EBV DNA levels by real-time PCR. The data set also contains samples from an independent cohort of symptomatic nasopharyngeal carcinoma patients. EBV isolated EBV genotype information from all samples was studied to establish a training model for nasopharyngeal cancer risk score prediction. In this study, plasma samples of an additional 31 patients with symptomatic nasopharyngeal carcinoma and 40 subjects with non-nasopharyngeal carcinoma were subjected to targeted capture sequencing as a test set. These 31 symptomatic nasopharyngeal carcinoma patients were from the clinical oncology department of the hong kong wils king hospital. Non-nasopharyngeal cancer subjects were also from the aforementioned nasopharyngeal cancer screening cohort (containing more than 20000 subjects) and randomly selected from them. EBV genotype variation was analyzed for these nasopharyngeal and non-nasopharyngeal cancer samples and their nasopharyngeal cancer risk scores were derived based on a trained model. There was no overlap of all nasopharyngeal and non-nasopharyngeal carcinoma samples in the training and test set.

And (4) targeted capture sequencing.

EBV DNA molecules in the plasma DNA library are enriched by a capture probe system (myBaits Custom capture Panel, Arbor Biosciences company), and the plasma sample is subjected to targeted capture sequencing. EBV capture probes are designed to cover the entire viral genome. Probes targeting 3000 human Single Nucleotide Polymorphism (SNP) sites are also included for reference. A probe mixture containing EBV probes and autosomal DNA probes at a molar ratio of 100:1 was used in each capture reaction. DNA libraries from 10 plasma samples were multiplexed in one capture reaction using equal amounts of DNA library from each sample. Sequence statistics for all cases (including cases previously reported for use as the current training set) are listed in tables 4A and 4B.

TABLE 4A sequencing statistics of all nasopharyngeal and non-nasopharyngeal carcinoma cases in training set

Group 0 is non-nasopharyngeal carcinoma subjects, group 1 is nasopharyngeal carcinoma subjects (screening cohort), and group 2 is nasopharyngeal carcinoma (external cohort).

TABLE 4B sequencing statistics for all nasopharyngeal and non-nasopharyngeal carcinoma cases in the test set

Group 0-non-nasopharyngeal carcinoma subjects, and group 1-nasopharyngeal carcinoma subjects

EBV mutation calling

Sequencing reads were aligned to human (hg19) and EBV reference genomes (AJ507799.2) using a BWA aligner, described in Li H et al, bioinformatics 2010; 26:589-95, which is incorporated herein by reference in its entirety. EBV Single Nucleotide Variants (SNVs) were identified using Samtools when alternative alleles were detected at different EBV genomic sites than the reference viral genome, as described by Li H et al, bioinformatics 2009; 2078-9, which are all incorporated herein by reference. In the subsequent analysis of nasopharyngeal carcinoma risk scores, SNV sites where more than 1 type of allele was detected were filtered out (minor allele frequency cut-off set at 5%).

Nasopharyngeal carcinoma risk score

In this example, the nasopharyngeal cancer risk score is the weighted sum (as an explanatory variable in a binary logistic regression model) of multiple EBV genotypes at a fixed set of SNV sites across the viral genome. The nasopharyngeal carcinoma-associated SNV set was first identified by analyzing the differences in the multiple EBV SNV gene profiles (profiles) from nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples in the training set. Fisher exact test was used to analyze the association of each variation in the EBV genome with nasopharyngeal cancer cases. Then, the obtained fixed set of significant SNV had a False Discovery Rate (FDR) controlled at 5%.

The nasopharyngeal cancer risk score of a test sample can be determined by its multiple EBV genotypes across a particular set of multiple significant SNV sites identified from the training set. As previously described, EBV DNA sequencing may not completely cover the entire EBV genome due to the low concentration of plasma EBV DNA molecules. Thus, the score was determined by the genotype pattern of SNV sites covered by plasma EBV DNA reads (e.g., with genotype information available) (fig. 16A, 16B, and 16C). To derive a nasopharyngeal carcinoma risk score, a subset of significant SNV sites are first identified, which are covered by plasma EBV DNA readings in the test sample. Then, within a significant subset of SNV sites, the genotype weight (effect size) for each site is determined. This was done by analyzing the genotype pattern of each site in the nasopharyngeal and non-nasopharyngeal cancer samples in the training dataset (FIG. 16B). On the basis, a logistic regression model is constructed to provide the influence of the risk genotype of each SNV locus on the nasopharyngeal carcinoma. The logical model is written as follows:

Can be rewritten as:

wherein n is the number of significant SNV sites; beta is a0And betakIs a plurality of coefficients that can be determined by a maximum likelihood estimator; p is the probability of nasopharyngeal carcinoma of EBV positive patients; this variable XkRepresenting the SNV site at genomic position k. If there is a variation in the sample that is identical to the EBV reference genome, XkThe code is-1. If there is a surrogate variation in the sample, then XkThe code is 1. If the sample does not contain the mutation site to be analyzed, XkThe code is 0. Python uses a 'logistic regression function' (penalty is 'l 2', C is 1, solver is saga, maximum number of iterations is (max _ iter) is 5000, and random state is 0) to estimate the coefficient β0And betak. This was done by analyzing the genotype pattern of each locus in the nasopharyngeal and non-nasopharyngeal carcinoma samples in the training dataset. The matrix (c + d). times.n is input to python, where c is the number of nasopharyngeal carcinoma samples, d is the number of non-nasopharyngeal carcinoma samples in the training set, and n is the number of genotype variations. Each row represents a sample (0 for patients without nasopharyngeal carcinoma; 1 for patients with nasopharyngeal carcinoma), and each column represents a variable. The coefficient (. beta.) can then be derived0And betak). The nasopharyngeal carcinoma risk score of the test sample is then derived from its possessed genotype at a plurality of SNV sites and by derivation from the training model the corresponding coefficient β 0And betakThe weighting is performed. (FIG. 16C).

Results

And constructing a nasopharyngeal carcinoma risk score training model.

As described above, previously reported plasma EBV DNA sequencing data of nasopharyngeal and non-nasopharyngeal cancer samples were used in the development of nasopharyngeal cancer risk score training models. Targeted capture sequencing has been performed to enrich EBV DNA in plasma samples. The SNV gene profile of EBV isolated viruses in nasopharyngeal and non-nasopharyngeal carcinoma samples was studied. From this data set, cases of nasopharyngeal and non-nasopharyngeal carcinoma were selected that covered at least 30% of the EBV genome by sequencing of the EBV DNA reads. This cut-off was chosen because more than 95% of nasopharyngeal carcinoma samples in the training dataset had viral genome coverage greater than the cut-off (tables 4A and 4B). The demographics of these selected nasopharyngeal and non-nasopharyngeal cancer subjects are detailed in table 5, including age and gender, and cancer stage information for nasopharyngeal cancer patients (AJCC, 8 th edition). The sequencing statistics for these selected nasopharyngeal and non-nasopharyngeal cancer samples are set forth in (tables 4A and 4B).

TABLE 5 training set of subject characteristics for all nasopharyngeal and non-nasopharyngeal cancer cases

The EBV-SNV gene profiles of 63 nasopharyngeal carcinoma and 88 non-nasopharyngeal carcinoma samples were analyzed. The median sequencing depth of the EBV genome for all samples was 2-fold (2 ×) (quartering distance (IQR), 1.0-fold to 9.2-fold). The average number of EBV SNV identified in nasopharyngeal carcinoma samples was 800(IQR, 662-958), while the average number of SNV in non-nasopharyngeal carcinoma samples was 539 (range 363-656). Overall, 5678 different SNVs were identified for all samples. The distribution of these SNVs in the EBV genome is shown in fig. 16D.

In the training set, the correlation of SNV of each virus to nasopharyngeal carcinoma samples was also studied by Fisher's exact test. A total of 661 significant SNVs associated with nasopharyngeal carcinoma were identified by controlling the False Discovery Rate (FDR) at 0.05, using adjusted p-values. The genomic positions of these 661 SNVs are listed in table 6. Subsequently, based on the genotype patterns of these 661 SNV sites, nasopharyngeal carcinoma risk scores were derived for the nasopharyngeal carcinoma and non-nasopharyngeal carcinoma subject plasma sample test sets.

TABLE 6661 EBV genomic locations for the exemplified SNVs (relative to AJ507799.2)

Evaluation of nasopharyngeal carcinoma risk score training model

The training model was evaluated to analyze samples in the training set for nasopharyngeal carcinoma risk scores using the omission one-out apreach (Leave one-out apreach) method. In the missing method, the principles of establishing a training model and deriving a nasopharyngeal cancer risk score are the same as described in the above method. All samples except one in the training set were used to construct the training model, and the missing one was used to analyze its nasopharyngeal carcinoma risk score. In the omission analysis, the median nasopharyngeal carcinoma risk score for the nasopharyngeal carcinoma cohort was 0.99(IQR, 0.98-1.0), the non-nasopharyngeal carcinoma cohort was 0.01(IQR, 0.00-0.89) (fig. 17A) Receiver Operating Characteristics (ROC) curve analysis was used to assess the difference between nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples by nasopharyngeal carcinoma risk score. The area under the curve has a value of 0.91 (fig. 17B).

Test focused nasopharyngeal carcinoma risk score analysis

Plasma samples from an additional 31 nasopharyngeal carcinoma patients and 45 non-nasopharyngeal carcinoma subjects were subjected to targeted capture sequencing. Wherein all 31 nasopharyngeal cancer samples and 40 non-nasopharyngeal cancer samples have at least 30% or more EBV genomic coverage by sequencing of the EBV DNA reads. Table 7 summarizes the clinical characteristics of these nasopharyngeal and non-nasopharyngeal cancer subjects. Sequencing statistics for this test set of samples are also illustrated in tables 4A and 4B.

TABLE 7 subjects who tested all cases of nasopharyngeal and non-nasopharyngeal carcinoma in a set of subjects

Nasopharyngeal carcinoma risk scores were analyzed for the test set of 31 nasopharyngeal carcinoma samples and 40 non-nasopharyngeal carcinoma samples based on the training model developed. The nasopharyngeal carcinoma risk score of a sample can be determined by its pattern of variation over 661 significant SNV positions determined in the training set. Since the EBV genome may not be completely covered, only SNV sites covered by sequenced EBV DNA reads and SNV sites with corresponding allelic information may be included in the nasopharyngeal carcinoma risk score analysis (fig. 16A, 16B and 16C).

The median nasopharyngeal carcinoma risk score in the nasopharyngeal carcinoma group was 0.999(IQR, 0.996-0.999), and the median nasopharyngeal carcinoma risk score in the non-nasopharyngeal carcinoma group was 0.557(IQR, 0.000-0.996) (fig. 18A). Similarly, a high nasopharyngeal carcinoma risk score was found in these 31 nasopharyngeal carcinoma specimens. Nasopharyngeal carcinoma samples in the test set can share a similar EBV SNV gene profile with nasopharyngeal carcinoma samples in the training set. Evaluation of nasopharyngeal carcinoma risk scores by ROC curve analysis distinguishes nasopharyngeal carcinoma from non-nasopharyngeal carcinoma samples. The area under the curve has a value of 0.83 (fig. 18B).

And (4) carrying out genotype pattern analysis on the high-risk variant sites in the test set.

There is a high risk of nasopharyngeal carcinoma-associated EBV variation in the EBER (EBV-encoded small RNA) region. In the EBER region, Hui et al reported 23 important SNVs. Similar nasopharyngeal cancer risk prediction methods were used in the test set of 31 nasopharyngeal cancer and 40 non-nasopharyngeal cancer samples, but only based on 23 reported genotype patterns of SNVs in the EBER region.

In the test set, 31 of 71 nasopharyngeal and non-nasopharyngeal cancer samples (44%) had EBV DNA reads covering all 23 SNV sites. As shown in table 8, for each of these 23 SNV sites, only a portion of the samples had genotype information available, with multiple reads covering multiple SNV sites (i.e., not all 23 SNV sites are covered by plasma EBV DNA reads in the sample). The percentage of high risk genotypes at each of the 23 SNV sites in the nasopharyngeal carcinoma sample was between 86% and 97%. The percentage of high risk genotypes in non-nasopharyngeal carcinoma samples was between 35% and 52%. The number of nasopharyngeal and non-nasopharyngeal carcinoma samples analyzed refers to samples with available genotype information (e.g., with multiple EBV DNA reads covering multiple SNV sites). Of the test set (31 nasopharyngeal carcinoma samples and 40 non-nasopharyngeal carcinoma samples), only a proportion of the samples had readings covering the SNV sites and the genotype information available for the corresponding sites. The difference between nasopharyngeal and non-nasopharyngeal cancer samples was assessed by analyzing the genotype pattern of only 23 SNVs in the EBER region by ROC curve analysis. The area under the curve value was 0.72 (fig. 19A and 19B). This value is lower than that obtained from the genotype pattern analysis of the entire EBV genome (0.83). Analysis of the genotype pattern of the whole EBV genome allows better discrimination between nasopharyngeal and non-nasopharyngeal cancer samples than with fixed viral genomic regions.

TABLE 8 genotype patterns tested for cases of nasopharyngeal and non-nasopharyngeal carcinoma that focused on 23 SNV sites of the EBER gene

Similarly, 3 high-risk SNVs on the BALF2(BamHI A left reading frame-2 (BamHI A left frame-2)) gene have also been reported (Xu et al, Nat Genet journal, 2019; 51: 1131-6). In the test set, 55 (78%) of the EBV DNA readings in 71 samples covered all 3 SNVs. For each of these 3 SNV sites, only a subset of the samples in the test set had reads covering the SNV site and had genotype information available (table 9). The percentage of high risk genotypes at each of the 3 SNV sites in the nasopharyngeal carcinoma sample was between 86% and 93%. The percentage of high risk genotypes in non-nasopharyngeal carcinoma samples was between 47% and 65%. There were 4 cases where none of the EBV DNA reads covered any of the 3 reported SNVs on BALF2 gene (1 nasopharyngeal carcinoma and 3 non-nasopharyngeal carcinoma samples), and these cases could not be analyzed. Similar nasopharyngeal carcinoma risk prediction methods were used in the remaining 30 nasopharyngeal carcinoma and 37 non-nasopharyngeal carcinoma samples in the test set, and only the genotype patterns of 3 SNVs reported in the BALF2 region were analyzed. The difference between nasopharyngeal and non-nasopharyngeal cancer samples was assessed by ROC curve analysis. The area under the curve value was 0.77 (fig. 20A and 20B). This value is lower than that obtained from the genotype pattern analysis of the entire EBV genome (0.83). Analysis of the genotype pattern of the whole EBV genome allows better discrimination between nasopharyngeal and non-nasopharyngeal cancer samples than with fixed viral genomic regions.

Table 9 tests genotype patterns for cases of nasopharyngeal and non-nasopharyngeal carcinoma focused at 3 SNV sites of BALF2 gene

The nasopharyngeal carcinoma risk score analysis described in this example allowed for nasopharyngeal carcinoma risk prediction based on genotype patterns with randomly selected floating numbers of SNVs in 661 significant SNV sets of the EBV genome (Table 6). The number of floating SNV sites for analysis of nasopharyngeal carcinoma risk scores can be determined by whether the sequenced EBV DNA reads cover the SNV sites and have the corresponding allelic information. 661 significant SNV sets were downsampled and samples were analyzed for nasopharyngeal carcinoma prediction performance in the analyzed test set using the same method as the number of SNVs floating within the downsampled SNV set. For downsampling analysis, a certain number of SNVs (e.g., 23, 25, 100, 200, or 500) were randomly selected from 661 significant SNVs. Then, for the test sample, the SNV sites in the down-sampled SNV set covered by the EBV DNA sequence reads are identified. The nasopharyngeal carcinoma risk score training model is then obtained by training the model in a training set using the genotype patterns of nasopharyngeal carcinoma and non-nasopharyngeal carcinoma samples at the overlaid, down-sampled SNV sites. By training, the genotype weight for each locus was determined for the training model. The nasopharyngeal cancer risk score of the test sample is then derived by applying its own genotype pattern to the nasopharyngeal cancer risk score training model at these overlaid, down-sampled SNV sites, which is weighted at the same down-sampled SNV sites. Table 10 summarizes the predicted performance of the nasopharyngeal carcinoma risk score training models for different numbers of SNV sites. For a given number of SNV sites, the SNVs were randomly selected for 10 downsampling, and the area under the curve in table 10 is the average result of 10 random downsampling. The SNV set in the whole EBV genome was downsampled to 23, which is the same number of SNVs reported in the EBER region. The difference between nasopharyngeal and non-nasopharyngeal cancer samples was assessed by ROC curve analysis. The area under the curve has a value of 0.78. This value is higher than the analysis of 23 reported SNV genotype patterns in the EBER region (0.72).

TABLE 10 nasopharyngeal carcinoma predictive Performance based on varying amounts of SNV

Number of SNV downsampled Area under the curve (AUC) values
23 0.78
25 0.78
100 0.77
200 0.83
500 0.79
661 (all SNV) 0.83

This study reports analysis of EBV genotype information by plasma DNA sequencing. The distinguishing molecular features of plasma EBV DNA molecules, both in number and size, were identified between nasopharyngeal carcinoma subjects and non-nasopharyngeal carcinoma subjects carrying plasma EBV DNA by paired-end sequencing. Incorporating this count and size based analysis of plasma EBV DNA into it can almost double the positive predictive value of current PCR-based protocols, which can form the basis of second generation sequencing-based screening assays. Sequencing plasma samples from nasopharyngeal and non-nasopharyngeal subjects can additionally generate EBV genotype information and can enhance their potential clinical value.

The nasopharyngeal cancer risk score can be determined from a viral genome-wide marker rather than a single gene marker. The risk score herein is derived from the pattern of variation of the SNV sites in the EBV genome. Plasma sequencing for EBV genotype information may involve sequencing plasma samples of low concentrations of EBV DNA molecules, thus resulting in incomplete coverage of the EBV genome. In some cases, the SNV site that provides useful information may not be covered by any EBV DNA reads, and in some cases, it is not possible to judge whether an individual carries a high risk EBV line type. This is supported by the following results: for each of the 23 reported SNV sites on the EBER gene, only some of the sample readings covered these sites in the 71 analyzed samples of the test set. The nasopharyngeal carcinoma samples in the test set showed a higher nasopharyngeal carcinoma risk score, which may indicate the presence of a nasopharyngeal carcinoma-associated EBV SNV gene profile. The capture probe method is used herein to enrich EBV DNA molecules in plasma samples. Amplicon sequencing methods can also be used to enrich for EBV DNA fragments that can obtain genotype information for regions of high risk variation.

The genotype patterns of the recently reported EBER gene and BALF2 gene high-risk variant sites in the test of the concentrated nasopharyngeal and non-nasopharyngeal cancer samples are analyzed. The distribution of high risk genotypes in nasopharyngeal and non-nasopharyngeal carcinoma samples was consistent with the results of both studies with analyzed cell samples (i.e., nasopharyngeal carcinoma tumor tissue and saliva samples from normal controls). Since all three studies, including the present study, were performed in the same or adjacent regions in the south China, the distribution of EBV genotypes in the normal control group may be similar. This provides evidence for the feasibility of EBV genotyping by sequencing plasma samples.

In the context of screening, analysis of EBV SNV from plasma samples may have clinical utility. As previously described, approximately 5% of the plasma of the screened population contained EBV DNA, but no nasopharyngeal carcinoma (false positive). The data herein show that these non-nasopharyngeal cancer subjects have different nasopharyngeal cancer risk scores, possibly involving different EBV SNV gene profiles. There may be a heterogeneous group of individuals who will be at varying risk of having nasopharyngeal carcinoma in the future. Some of those carrying high-risk EBV strains are at higher risk of developing nasopharyngeal carcinoma in the future. The nasopharyngeal carcinoma risk score can be used to classify non-nasopharyngeal carcinoma subjects into different risk groups based on the SNV gene profile of the virus genome. In one example, more frequent screening may be warranted for subjects with a high nasopharyngeal carcinoma risk score.

EBV genotype information was analyzed by sequencing plasma samples from nasopharyngeal and non-nasopharyngeal subjects. While previous studies focused on identifying high risk variations associated with nasopharyngeal carcinoma at the population level, the present study provides insight into the clinical use of viral genotyping. Such an analysis can inform on an individual basis of the risk of cancer by characterizing the EBV genotype.

While preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.

85页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:含铬铁液的制造方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!