Genome-wide and targeted haplotype reconstruction

文档序号：1094895 发布日期：2020-09-25 浏览：8次中文

阅读说明：本技术 全基因组且靶向的单体型重构 (Genome-wide and targeted haplotype reconstruction ) 是由 B.任 S.塞尔瓦拉 J.狄克逊 A.施米特于 2014-07-18 设计创作，主要内容包括：本发明涉及用于单体型确定的方法,且具体的是在全基因组水平的单体型确定以及靶向单体型确定。(The present invention relates to methods for haplotype determination, and in particular haplotype determination at the whole genome level and targeted haplotype determination.)

1. A method for whole chromosome haplotype analysis of an organism comprising:

providing a cell of the organism containing a genome having genomic DNA;

incubating the cells or nuclei thereof with a fixation agent for a period of time to allow the genomic DNA to crosslink in situ and thereby form crosslinked genomic DNA;

fragmenting the cross-linked genomic DNA and ligating adjacently positioned cross-linked and fragmented genomic DNA to form a proximally ligated complex having a first genomic DNA fragment and a second genomic DNA fragment;

cleaving the proximally ligated complex to form proximally ligated DNA fragments;

obtaining a plurality of said proximally ligated DNA fragments to form a library;

sequencing the plurality of proximally-ligated DNA fragments to obtain a plurality of sequence reads;

performing local conditional phasing; and

assembling the plurality of sequence reads to construct a haplotype of the chromosomal span of one or more chromosomes, wherein the cross-linked genomic DNA is not hybridized prior to the step of fragmenting,

wherein the method is not a diagnostic method.

2. The method of claim 1, wherein said local conditional phasing is neighborhood corrected phasing.

3. The method of claim 1, wherein the step of performing local conditional phasing uses population-scale sequencing data.

4. A method for targeted haplotype analysis of an organism comprising:

providing a cell of the organism containing a genome having genomic DNA;

incubating the cells or nuclei thereof with a fixation agent for a period of time to allow in situ cross-linking of the genomic DNA and thereby form cross-linked genomic DNA;

fragmenting the cross-linked genomic DNA and ligating adjacently positioned cross-linked and fragmented DNA to form a proximally ligated complex having a first genomic DNA fragment and a second genomic DNA fragment;

cleaving the proximally ligated complex to form proximally ligated DNA fragments;

contacting the proximally ligated DNA fragments with one or more oligonucleotides that hybridize to preselected regions of a subset of the proximally ligated fragments to provide a subset of proximally ligated fragments that hybridize to the oligonucleotides, separating the subset of proximally ligated fragments from the oligonucleotides;

sequencing a subset of the proximally ligated DNA fragments to obtain a plurality of sequence reads;

performing local conditional phasing; and

assembling the plurality of sequence reads to construct a targeted haplotype, wherein the cross-linked genomic DNA is not purified prior to the fragmenting step,

wherein the method is not a diagnostic method.

5. The method of claim 4, wherein said local conditional phasing is neighborhood corrected phasing.

6. The method of claim 4, wherein the step of performing local conditional phasing uses population-scale sequencing data.

7. The method of claim 4, wherein the oligonucleotide is immobilized onto a solid substrate.

8. The method of claim 1 or 4, further comprising isolating nuclei from said cells prior to the incubating step.

9. The method of claim 1 or 4, further comprising, after the fragmenting step,

labeling the first genomic DNA fragment or the second genomic DNA fragment with a marker;

ligating the first and second genomic DNA fragments such that the marker is between them to form a labeled chimeric DNA molecule; and

cleaving the labeled chimeric DNA molecule to form labeled, proximally linked DNA fragments.

10. The method of claim 1 or 4, wherein the fragmenting step is performed by digesting the ligated genomic DNA with a restriction enzyme to form a digested genomic DNA fragment.

11. The method of claim 1 or 4, wherein the fixative reagent comprises formaldehyde, glutaraldehyde, or formalin.

12. The method of claim 9, wherein the labeling step is performed by filling in the ends of the first or second genomic DNA fragments with nucleotides labeled with the markers.

13. The method of claim 12, wherein the marker is biotin.

14. The method of claim 13, wherein said obtaining step is performed using streptavidin.

15. The method of claim 14, wherein the streptavidin is immobilized to a bead.

16. The method of claim 9, wherein the ligating step is performed by ligating the first genomic DNA fragment and the second genomic DNA fragment using a ligase.

17. The method of claim 16, wherein the linking is performed in solution.

18. The method of claim 16, wherein the linking is performed on a solid substrate.

19. The method of claim 1 or 4, wherein sequencing is performed using paired-end sequencing of paired-end sequencing fragments.

20. The method of claim 19, wherein each paired-end sequencing read is at least 20bp in length.

21. The method of claim 19, wherein each paired-end sequencing read is 20-150bp in length.

22. The method of claim 19, wherein each paired-end sequencing read has a length of 20,25,30,40,50,60,70,80,90,100,110,120,130,140, or 150 bp.

23. The method of claim 1 or 4, wherein the library contains at least 15x sequence coverage for each chromosome.

24. The method of claim 23, wherein the library contains a 25-30x sequence coverage for each chromosome.

25. The method of claim 21, wherein the first genomic DNA fragment and the second genomic DNA fragment are on the same chromosome.

26. The method of claim 25, wherein the first genomic DNA fragment and the second genomic DNA fragment are at least 100bp apart in situ.

27. The method of claim 26, wherein the first genomic DNA fragment and the second genomic DNA fragment are between 100bp and 100Mb apart in situ.

28. The method of claim 27, wherein the first genomic DNA fragment and the second genomic DNA fragment are separated in situ by 100bp,1kb,10kb,1Mb,10Mb,20Mb,30Mb,40Mb,50Mb,60Mb,70Mb,80Mb,90Mb, or 100 Mb.

29. The method of claim 1 or 4, wherein the organism is a eukaryote.

30. The method of claim 1 or 4, wherein the organism is a fungus.

31. The method of claim 1 or 4, wherein the organism is a plant.

32. The method of claim 1 or 4, wherein the organism is an animal.

33. The method of claim 1 or 4, wherein the organism is a mammal or a mammalian embryo.

34. The method of claim 1 or 4, wherein the organism is a human or a human embryo.

35. The method of claim 34, wherein the human is a donor or recipient of an organ.

36. The method of claim 35, wherein the organ is haplotyped prior to transplantation of the organ to a recipient having a matching haplotype.

37. The method of claim 1 or 4, wherein the cell is a diploid cell.

38. The method of claim 1 or 4, wherein the cell is an aneuploid cell.

39. The method of claim 1 or 4, wherein the cell is a cancerous cell.

40. A method for whole chromosome haplotype analysis of an organism comprising:

providing a cell of the organism containing a genome having genomic DNA;

incubating the cells or nuclei thereof with a fixation agent for a period of time to allow the genomic DNA to crosslink in situ and thereby form crosslinked genomic DNA;

cleaving the proximally ligated complex to form proximally ligated DNA fragments;

obtaining a plurality of said proximally ligated DNA fragments to form a library;

sequencing the plurality of adjacently ligated DNA fragments to obtain a plurality of sequence reads, and

and wherein the method further comprises, after the fragmenting step,

labeling the first genomic DNA fragment or the second genomic DNA fragment with a marker;

ligating the first and second genomic DNA fragments such that the marker is between them to form a labeled chimeric DNA molecule; and

cleaving the labeled chimeric DNA molecule to form labeled, proximally linked DNA fragments,

and wherein the labeling step is performed by filling in the ends of the first or second genomic DNA fragments with nucleotides labeled with the markers

Wherein the method is not a diagnostic method.

41. A method for targeted haplotype analysis of an organism comprising

Providing a cell of the organism containing a genome having genomic DNA;

incubating the cells or nuclei thereof with a fixation agent for a period of time to allow in situ cross-linking of the genomic DNA and thereby form cross-linked genomic DNA;

cleaving the proximally ligated complex to form proximally ligated DNA fragments;

sequencing a subset of the proximally-ligated DNA fragments to obtain a plurality of sequence reads, and assembling the plurality of sequence reads to construct a targeted haplotype, wherein the cross-linked genomic DNA is not hybridized prior to the step of fragmenting,

and wherein the method further comprises, after the fragmenting step,

labeling the first genomic DNA fragment or the second genomic DNA fragment with a marker;

ligating the first and second genomic DNA fragments such that the marker is between them to form a labeled chimeric DNA molecule; and

cleaving the labeled chimeric DNA molecule to form labeled, proximally linked DNA fragments,

and wherein the labeling step is performed by filling in the ends of the first or second genomic DNA fragments with nucleotides labeled with the markers,

wherein the method is not a diagnostic method.

42. The method of claim 40 or 41, wherein the marker is biotin.

43. The method of claim 42, wherein said obtaining step is performed using streptavidin.

44. The method of claim 43, wherein the streptavidin is immobilized to a bead.

Technical Field

The present invention relates to methods for haplotype determination, and in particular haplotype determination at the whole genome level, and targeted haplotype determination.

Background

Rapid progress in DNA shotgun sequencing technology has enabled the systematic identification of genetic variants in individuals (Wheeler et al, Nature 452,872-876 (2008); Pushkarev et al, Nature Biotechnology 27,847-850 (2009); Kitzman et al, Science relative Medicine 4,137ra176 (2012); and Levy et al, Plos Biology 5, e254 (2007)). However, since the human genome consists of two homologous sets of chromosomes, understanding the true genetic makeup of an individual requires delineation of the maternal and paternal copies of the genetic material, or haplotypes (haplotypes). The utility of obtaining a haplotype in an individual can be several fold: first, haplotypes are clinically useful for the prediction Of donor-recipient matching outcomes in organ transplantation (Crawford et al, annular Review Of Medicine 56,303-320(2005) and Petersdorf et al, PLoS Medicine 4, e8(2007)) and are increasingly being used as a method to detect disease-association (Studies et al, Nature 447,655-660 (2007); Ciruli et al, Nature reviews. Genetics 11,415-425 (2010); and Ng et al, Nature Genetics 42,30-35 (2010)). Second, in genes that exhibit compound heterozygosity, haplotypes provide information about whether two deleterious variants are negatively located on the same or different alleles, which greatly affects the prediction of whether the inheritance of these variants are deleterious (Musone et al, Nature Genetics 40, 1062-. In complex genomes (e.g., humans), complex heterozygosity may involve genetic or epigenetic variations at non-coding cis-regulatory sites that are located away from the genes they regulate (Sanyal et al, Nature 489,109-113(2012)), underscoring the importance of obtaining haplotypes of chromosomal spans (chromosome-span). Third, haplotypes from the group of individuals provide information about the structure of the population (International HapMap, C. et al, Nature449,851-861 (2007); genome Project, C. et al, Nature467, 1061-. Finally, the widespread allelic imbalance (allelicim balances) in recently described gene expression suggests that genetic or epigenetic differences between alleles may contribute to quantitative differences in expression (Gimelbrant et al, Science318, 1136-. Therefore, understanding of haplotype structure will be critical to delineate the variant mechanisms contributing to these allelic imbalances. In general, knowledge of the complete haplotype structure in an individual is critical to the advancement of personalized medicine.

Recognizing the importance of haplotypes, several groups have sought to expand the understanding of haplotype structure both at the population and individual level. Initiatives such as International Hapmap engineering and 1000 genome engineering have attempted to systematically reconstruct (recanstruct) haplotypes based on unrelated population sequencing data by linkage disequilibrium measurements (linkage disequilibrium measurers) or by genotyping of family three (family trios). However, the average length of the correctly phased (phased) haplotypes generated using this method is limited to about 300kb (Fan et al, Nature Biotechnology 29,51-57(2011) and Brown et al, American Journal of human genetics 81, 1084-. A number of experimental approaches have also been developed to facilitate haplotype phasing of individuals, including LFR sequencing, mate-pair (mate-pair) sequencing, fosmid sequencing, and dilution-based sequencing (Levy et al, PLoS biology 5, e254 (2007); Bansal et al, Bioinformatics 24, i153-159 (2008); Kitzman et al, Nature Biotechnology 29,59-63 (2011); Suk et al, Genome Research 21, 1672-. At best, these methods can reconstruct haplotypes ranging from a few kilobases to about a million kilobases, but none have been able to achieve a haplotype of chromosomal span. Whole chromosome haplotype phasing was accomplished using fluorescence-assisted cell sorting (FACS) based sequencing, chromosome segregation and subsequent sequencing, and chromosome microdissection (micro-dissection) based sequencing (Fan et al, Nature Biotechnology 29,51-57 (2011; Yang et al, Proceedings of the National Academy of Sciences of the United States of America 108,12-17 (2011); and Ma et al, Nature Methods 7, 299-. However, these methods are low resolution, as they can only phase heterozygous variants in a fraction of individuals, and more importantly, their implementation is technically challenging or requires specialized instrumentation. Recently, whole Genome haplotyping has been performed using genotypic analysis from sperm cells (Kirkness et al Genome Research 23,826-832 (2013)). Although this method can generate genome-spanning haplotypes at high resolution, it is not suitable for the general population and requires deconvolution of complex meiotic recombination patterns (deconvolution).

Along with whole genome haplotyping, targeted haplotyping (targeted haplotyping) is also important. In particular, targeted haplotype analysis of HLA (human leukocyte antigen) loci can aid in receptor-donor matching for organ transplantation and elucidate the role of cis regulatory elements in gene activity.

Computational analysis has shown that an important factor in haplotype reconstruction from previously established shotgun sequencing of DNA is the length of the sequenced genomic fragment (Tewy et al, Nature reviews. genetics 12, 215-. For example, a longer haplotype (about 5kb in fragment or insert size) can be obtained by mate pair sequencing (mate pair sequencing) compared to conventional genomic sequencing (about 500bp in fragment or insert size). However, there are technical limitations on how long these fragments can be. For example, it is difficult to clone a longer DNA fragment than that obtained using the fosmid clone. Thus, using existing shotgun sequencing methods, it is difficult to generate haplotype blocks (blocks) of over 1 million bases, even at ultra-deep sequencing coverage.

Therefore, there is a need for methods of reconstructing haplotypes at the genome-wide level, as well as methods of targeted haplotype analysis.

Summary of The Invention

The present invention addresses the above unmet needs by providing methods for reconstructing haplotypes at the whole genome level and methods for reconstructing haplotypes at targeted regions of the genome.

Thus, the invention features a method for whole chromosome haplotype analysis of an organism. The method comprises providing cells of the organism containing a genome having genomic DNA (aset of chromosomes); incubating the cell or its nucleus with an immobilization agent (immobilization agent) for a period of time and restricting the immobilized DNA using a restriction enzyme, thereby allowing in situ proximity ligation of the genomic DNA to form ligated genomic DNA; fragmenting (fragmenting) the ligated genomic DNA to form a contiguously ligated complex having a first genomic DNA fragment and a second genomic DNA fragment; obtaining a plurality of adjacently ligated DNA fragments to form a library; sequencing the plurality of proximally-ligated DNA fragments to obtain a plurality of sequence reads, and assembling the plurality of sequence reads to construct a haplotype of the chromosomal span of one or more chromosomes.

The invention also provides methods for targeted haplotype analysis of an organism. The method includes providing a cell of the organism that contains a genome having genomic DNA; incubating the cells or nuclei thereof with an immobilizing agent for a period of time and restricting the immobilized DNA using a restriction enzyme to allow in situ proximal ligation of genomic DNA to form ligated genomic DNA; fragmenting the ligated genomic DNA to form a proximally-ligated complex having a first genomic DNA fragment and a second genomic DNA fragment; contacting the contiguously ligated DNA fragments with one or more oligonucleotides that hybridize to preselected regions of the subset of contiguously ligated fragments to provide a subset of contiguously ligated fragments that hybridize to the oligonucleotides, separating the subset of contiguously ligated fragments from the oligonucleotides; sequencing a subset of the proximally-ligated DNA fragments to obtain a plurality of sequence reads, and assembling the plurality of sequence reads to construct a targeted haplotype. In one embodiment, the oligonucleotide is immobilized.

In certain embodiments, the method further comprises isolating nuclei from the cells prior to the incubating step. Methods for isolating cell nuclei are known in the art. For example, methods for isolating nuclei from plant cells are described by Lee et al, (2007)The Plant Cell19: 731-.

In some embodiments, the method further comprises purifying the ligated genomic DNA prior to the fragmenting step. In other embodiments, the method further comprises, after the fragmenting step, labeling the first or second genomic DNA fragments with a marker; ligating the first and second genomic DNA fragments such that the marker is located therebetween to form a labeled chimeric DNA molecule; and cleaving the labeled chimeric DNA molecule to form labeled, proximally linked DNA fragments.

In the above method, the fragmentation step may be performed by various methods known in the art. For example, it may be performed by enzymatic cleavage, including mediated by restriction enzymes, dnases, or transposases. In one embodiment, this step is performed by digesting the ligated genomic DNA with restriction enzymes to form digested genomic DNA fragments. Any suitable restriction enzyme (e.g., BamHI, EcoRI, HindIII, NcoI, or XhoI) or a combination of two or more of these restriction enzymes may be used. The fixing agent may comprise formaldehyde, glutaraldehyde, or formalin. The labeling step may be performed by filling the ends of the first or second genomic DNA fragments with nucleotides labeled with a marker (e.g., biotin). In this case, the obtaining step may be performed using streptavidin, which may be adhered to the beads. For the ligation step, it may be performed by ligating the first genomic DNA fragment and the second genomic DNA fragment using a ligase. The linking step may be performed in solution or on a solid substrate. Ligation on a solid substrate is referred to herein as "tethered chromosome capture". For sequencing, it can be performed using paired-end sequencing (pair-end sequencing).

In one embodiment of the invention, each paired-end sequencing read fragment may be at least 20bp in length, for example 20-1000bp in length or preferably 20-150bp in length (e.g., 20,25,30,40,50,60,60,80,90,100,110,120,130,140, or 150bp in length). For haplotype analysis of each chromosome, the library contains at least 15x sequence coverage, e.g., 25-20x sequence coverage. Preferably, the first and second genomic DNA fragments are on the same chromosome or are in cis. Preferably, the first genomic DNA fragment and the second genomic DNA fragment are at least 100bp apart in situ, such as 100-100MB (e.g., 100bp,1kb,10kb,1Mb,10Mb,20Mb,30Mb,40Mb,50Mb,60Mb,70Mb,80Mb,90Mb, or 100 Mb).

The methods can be used in a variety of organisms, including prokaryotes and eukaryotes. Such organisms include fungi, plants and animals. In a preferred embodiment, the organism is a plant. In another preferred embodiment, the organism is a mammal or a mammalian embryo, or a human embryo. In one embodiment, the human is a donor or recipient of an organ. In this case, the organ may be haplotype analyzed using the methods of the invention prior to transplantation into a recipient with a matching haplotype. The methods of the invention may be used with diploid, aneuploid, or polyploid cells, e.g., certain cancerous cells.

A detailed description of one or more embodiments of the invention is set forth in the following description. Other features, objects, and advantages of the invention will be apparent from the description and from the claims.

The present invention includes the following embodiments:

Brief Description of Drawings

FIGS. 1a-c are a set of graphs showing a comparison of HaploSeq with other methods for reconstructing haplotypes of organisms: (a) the figures outline several methods for phasing haplotypes; (b) frequency distribution of insert sizes from conventional Whole Genome Sequencing (WGS), mate-pairs and Hi-C; (c) the graph shows the role of proximity ligation reads in the construction of haplotypes of chromosomal spans.

FIGS. 2a-c are a set of graphs showing that adjacent ligation products are predominantly in-haplotype (intra-haplotype); (a) whole genome interaction frequency heatmap; (b) frequency of interaction (log) between any two fragments₁₀Proportional) as a function of linear distanceCounting; (c) the h-trans interaction (h-trans interaction) probability was compared as a function of insert size.

Fig. 3a-d are a set of graphs showing that HaploSeq allows accurate, high resolution, and haplotype reconstruction of the chromosome span: (a) a map of Hi-C reads (upper and lower bars) derived from 129 alleles spanning about 30Mb of chromosome 18 and used to join variants into a single chromosome spanning haplotype; (b) a table of results of haplotyping phasing based on Hi-C in the CASTxJ129 system; (c) comparison by modeling the haplotype phasing method to generate complete haplotypes; (d) analysis of Adjusted Span (AS) for haplotype phasing.

Figures 4a-d are a set of graphs showing single-body reconstitution in human GM12878 cells using HaploSeq: (a) the figure demonstrates the difference in variant frequency between mouse (CASTX129) and human (GM12878) at the Hoxd13/HOXD13 gene; (b) the table describes the completeness of haplotype reconstruction using HaploSeq analysis in the context of low variant density in the CASTxJ129 system ("chromosome% spanned in MVP blocks"), resolution ("variant% phased in MVP blocks"), and precision ("precision% of variant phased in MVP blocks"); (c) table of results based on HaploSeq haplotype reconstruction in GM12878 cells; (d) Hi-C produced seed haplotypes (seed haplotypes) span the centromere of the mesocentromeric chromosome.

FIGS. 5a-d are a set of graphs showing that HaploSeq analysis in combination with local conditioning phasing (local conditioning) allows high resolution haplotype reconstruction in humans: (a) the figure depicts the ability to perform local conditional phasing; (b) the table demonstrates the resolution and overall accuracy of haplotype phasing using HaploSeq after local conditional phasing in GM12878 cells; (c) the figure demonstrates the ability to complete a seed haplotype (MVP block) of a chromosomal span under different read length and overlay parameters; (d) the dots illustrate the ability of different combinations of read length and coverage to produce a high resolution seed haplotype.

The graph of fig. 6 shows the probability of h-trans interaction for each CASTxJ129 chromosome, plotted as a function of insert size.

FIGS. 7a-d are a set of graphs showing graphical interpretations of completeness, accuracy, and resolution in haplotype phasing, (a) nucleotide bases represent heterozygous SNPs, and "-" represents no variability (variabilty); (b) haplotype phasing of MVP blocks indicates resolution; (c) the true haplotypes known a priori and this knowledge helps measure the accuracy of the predicted de-novo haplotypes, and imprecise variant phasing is shown in the grey box positions; (d) different metrics.

Fig. 8a-b are a set of graphs showing a restricted hapcu model that only allows fragments up to a certain maximum insert size (maxIS), where at higher maxIS the resolution (a) of the MVP segment is higher but with higher precision (b).

FIG. 9 is a chart showing the capture-HiC protocol.

FIGS. 10a-b show the capture-HiC probe design: (a) UCSC Genome Browser (Genome Browser) shot (shot) of human HLA locus (hg19) and (b) HLA-DQB1 gene amplified UCSC Genome Browser shot to demonstrate the probe targeting method.

Detailed Description

Rapid advances in high-throughput DNA sequencing technologies have accelerated the pace of individualized medical research. Although methods for variant discovery and genotyping of Whole Genome Sequencing (WGS) datasets are well established, linking variants on chromosomes into individual haplotypes remains a challenge.

Whole genome haplotype analysis and reconstruction

The present invention provides novel methods for haplotype analysis, including proximity ligation and DNA sequencing techniques and probabilistic algorithms for haplotype assembly (haplotype assembly) (Dekker et al, Science 295, 1306-. Referred to as "HaploSeq" (indicating "haplotype analysis using proximity ligation and sequencing: (Haplotyping usingProximity-Ligation andSeqSounding) ")The complete haplotype or the targeted haplotype was reconstructed by proximity ligation and shotgun DNA sequencing. As disclosed herein, HaploSeq has been experimentally confirmed in the prior experiments on hybrid mouse embryonic stem cell lines and human lymphoblastoid cell lines of known complete haplotypes. It is demonstrated herein that using HaploSeq, the reconstruction of haplotypes of chromosomal spans can be done in mice, linking over 95% of alleles with an accuracy of about 99.5%. HaploSeq was combined with local conditional phasing in a human cell line using 17x coverage only genome sequencing, to obtain haplotypes of chromosome spans with about 98% accuracy at about 81% resolution. These results establish the utility of proximity ligation and sequencing for analysis of single bodies in a population.

One embodiment of the HaploSeq process of the present invention is shown in figure 1. Briefly, FIG. 1a depicts a comparison of HaploSeq and other methods for reconstructing an individual haplotype. The figure outlines several methods for phasing haplotypes. Unlike previous methods, the proximal-ligation contacts spatially proximal distal DNA fragments. These fragments were then isolated from the cells and sequenced.

FIG. 1b shows the frequency distribution of insert sizes from conventional WGS, mate pairs (Gnerre, S. et al, Proceedings of the national academy of Sciences of the United States of America 108,1513-1518(2011)) and Hi-C. Base pair on the x-axis (log)₁₀Ratio). The figure represents a random subset of data points taken from previous publications on GM12878 cells between chromosomes 1-22. In the case of fosmid (Kidd et al, Nature 453,56-64(2008)), the size distribution of the clones deduced after alignment is shown. Hi-C insert sizes were obtained from libraries generated by the inventors' laboratories. The size of the inserts and clones correlated with the ability to reconstitute longer haplotypes. In these methods, abundant long fragments are generated based on only proximity-linked Hi-C.

FIG. 1c shows the role of proximity-ligation reads in establishing chromosome span haplotypes. The top and bottom sequences represent regions of two homologous chromosomes, where "-" represents no variability and nucleotides represent heterozygous SNPs. Heterozygous SNPs and indels (indels) can be used to distinguish the homologous chromosomes. Local haplotype blocks ("block 1" and "block 2") can be established from short insert sequencing reads (i), similar to what occurs in conventional WGS or mate pair sequencing. These small haplotype blocks are still ill-defined relative to each other, taking into account the distance between the variants. (ii) regions located far away in terms of linear sequence can be brought into close proximity in situ. These contacts will be saved by proximity-connection. The large insert size of the neighbor-joining sequencing reads helps to merge smaller haplotype blocks into a single chromosome spanning haplotype (iii).

Hi-C technology is known in the art, and related protocols can be found in US20130096009 and Lieberman-Aiden et al, Science 326,289-293(2009), the contents of which are incorporated herein by reference. In one embodiment, the Hi-C method comprises purification of the ligation product followed by massively parallel sequencing. In one embodiment, the Hi-C method allows for unbiased (unbiased) identification of chromatin interactions across the entire genome. In one embodiment, the method may include steps including, but not limited to, cross-linking the cells using formaldehyde; digesting the DNA with restriction enzymes, leaving 5 '-overhangs (5' -overhangs); filling the 5' -overhang, which includes biotinylated residues; and ligating the blunt-ended fragments under dilution conditions that favor ligation events between the crosslinked DNA fragments. In one embodiment, the method can produce a DNA sample containing ligation products consisting of fragments in the nucleus that are initially in close spatial proximity, labeled with biotin residues at the junction. In one embodiment, the method further comprises creating a library (i.e., for example, a Hi-C library). In one embodiment, the library is created by shearing the DNA and selecting biotin-containing fragments using streptavidin beads. In one embodiment, the library is then analyzed using massively parallel DNA sequencing, generating a catalog of interacting fragments (catalog). See fig. 1 a.

As disclosed herein and shown in fig. 2, the proximity-ligation products obtained by the method of the present invention are predominantly in-haplotype. For this purpose, FIG. 2a shows the complete geneHeat map of group interaction frequencies. Hi-C reads originating from CAST ("C") or J129 ("J") genomes are distinguished based on the known haplotype structure of the parent strain. The interaction frequency between each allele of each chromosome was calculated using 10Mb binary size (bin size). The CAST or J129 alleles of each chromosome interact mainly in cis, confirming that the individual alleles occur in the chromosomal domain seen in the Hi-C data (chromosome terrierites). Inset shows an enlarged view of CAST and J129 alleles of chromosome 12 to 16. Furthermore, FIG. 2b shows the interaction frequency (log) between any two fragments₁₀Proportional) as a function of linear distance. Read-pair (read-pair) is distinguished as cis (top) and h-trans (bottom) interactions based on previous haplotype information. The frequency of cis interaction can be several orders of magnitude more common than h-trans. Of note, over large genomic distances: (>100Mbp), the frequency of cis-interaction approaches that of h-trans-interaction, and<2% of the total h-trans interactions. Maps were generated using data from chromosomes 1-19 in the CASTxJ129 system. Finally, fig. 2c shows a comparison of the probability of h-trans interaction as a function of insert size. Maps were generated using data from chromosomes 1-19 in the CASTxJ129 system. The LOWESS fit was performed with 2% smoothing. Below 30Mb, the read-out is that the probability of h-trans interaction is < 5% (dashed line). This cut-off value was therefore used as the maximum insert size for further analysis.

The HaploSeq method of the present invention allows accurate, high resolution, and haplotype reconstruction of the chromosome span. Figure 3a shows a map of Hi-C reads generated from 129 alleles spanning a total of about 30Mb of chromosome 18 and used to join variants into haplotypes of a single chromosome span. The sequence of the Hi-C read is shown in black text, with variant positions shown in red and underlined. The sequence of the reference genome is grey. At the variant positions and haplotypes predicted based on the Hi-C data, a priori CAST and J129 haplotypes for each genotype were used. At these four bases, Hi-C generated a perfect match in identifying the known haplotype structure. The hapclut can then use these heterozygous variants as nodes and these overlapping reads as edges to form a graph structure.

The table in fig. 3b shows the results of Hi-C based haplotype phasing in the CASTxJ129 system. The "Pharmasable span of a chromosome" column lists the number of phasable bases (base pair difference between the first and last heterozygous variant). The column "variants crossed in MVP block" lists the total number of heterozygous variants crossed by MVP block per chromosome, which is a measure of total substitution, and is used as the denominator in estimating resolution. The percentage of phasing bases spanned by the predicted haplotype is listed in the "% chromosomes spanned in the MVP Module" column. The percentage of all heterozygous variants phased across variants in a MVP block is listed in the column "% variants phased in MVP block". Listed in the last column is the accuracy of each phased heterozygous variant. For each chromosome, the inventors generated complete (spanning > 99.9% of bases), high resolution (phased > 95% heterozygous variants), and accurate (correctly phased > 99.5% heterozygous variants) haplotypes.

FIG. 3c further shows a comparison of haplotype phasing approaches by simulation to generate a complete haplotype. The inventors overlaid 75 base pair paired end sequencing data (chromosome 19) that mimics conventional shotgun sequencing (average 400, sd 100), paired pairs (average 4500, sd 200) and fosmid (average 35000, sd 2500) at 20 ×. Although the first read was randomly placed in the genome, the second read was selected based on the normal distribution parameters described above. We subsampled (sub-sampled) the CASTxJ129 data to generate 20 × Hi-C fragments, which were used for HaploSeq analysis. The Y-axis represents the span of the MVP block as a function of the phasing span of chromosome 19. The MVP blocks in HaploSeq span the entire chromosome, while in other approaches the MVP blocks span only a portion of the chromosome. The inventors also combined the 20x sequencing coverage of each method with the 20x conventional WGS data, for a total of 40x coverage to compare the methods at higher coverage.

Fig. 3d shows an analysis of the Adjustment Span (AS) for haplotype phasing. The AS is defined AS the product of the span and the fraction of heterozygous variants phased in the block. Haplotype blocks were ranked by the number of heterozygous variants phased in each module (ranked on the x-axis) and accumulated AS in the entire chromosome was represented on the y-axis. In the case of HaploSeq, individual MVP blocks span 100% of the chromosomes and contain 90% of phased variants. In other approaches, the percentage phasing increased cumulatively with the inventors' inclusion of the non-MVP module. The dashed line represents the increased 40x coverage by combining with WGS data as discussed above.

The HaploSeq method of the invention also allows haplotype reconstitution in human cells (e.g., GM12878 cells). To this end, FIG. 4a demonstrates the difference in variant frequency between mouse (CASTX129) and human (GM12878) in the Hoxd13/HOXD13 gene. Hi-C read coverage (log) in these loci is also shown₁₀Ratio). The Hi-C read is more likely to contain variants (shown as "read over SNP") at high SNP densities (mouse). This in turn allows these variants to be more easily connected to MVP blocks. In the case of low variant densities (human), this is not the case, so there is a "gap" where the variants remain unphased relative to the MVP block.

Furthermore, the table in fig. 4b shows the completeness of haplotype reconstruction using HaploSeq analysis ("chromosome% spanned in MVP block"), resolution ("variant% phased in MVP block"), and precision ("precision% of variant phased in MVP block") at low variant density in the CASTxJ129 system. Variants were subsampled in the CASTx129 genome to have 1 heterozygous variant per 1500 bases and phased as described above. The inventors continued to produce both complete (> 99% chromosome span) and accurate (> 99% accuracy) haplotypes. However, at low variant densities, there is a reduction in the resolution of the phased variants (about 32%). Numbers are rounded to three decimal places.

In addition, the table in fig. 4c summarizes the results of HaploSeq-based haplotype reconstitution in GM12878 cells. The results show completeness (% "chromosome spanned in MVP block) and resolution (%" (variant phased in MVP block). The inventors were able to generate haplotypes of a chromosome span (> 99%), although at a lower resolution (about 22%). In GM12878 cells, the inventors generated about 17x coverage when compared to about 30x in the CASTxJ129 system. Thus, the inventors observed a lower resolution (22%) when compared to the low density CASTxJ129 (32%). Numbers are rounded to three decimal places.

As shown in FIG. 4d, the method of the present invention allows for the generation of a seed haplotype that spans the centromere of the centromeric chromosome. Two regions on either side of chromosome 2 centromere are shown. These two Hi-C produced seed haplotypes were arbitrarily designated "A" and "B". The actual haplotypes of GM12878 individuals known from three-person group sequencing (trio sequencing) are shown below, arbitrarily designated "a" and "B". Hi-C produced seed haplotypes that matched the actual haplotype at both ends of the centromere. Notably, some variants in the actual haplotype remain unordered, thus contributing to the "gap" in the seed haplotype. In addition, the actual haplotype does not contain all variants, because the three-panel sequencing is performed at low depth, so the seed haplotype contains some phased variants that are not in the actual haplotype (see, e.g., the third variant in the AAK1 region).

The HaploSeq analysis can be used in conjunction with other techniques, such as local conditioning phasing to allow high resolution haplotype reconstruction in humans. Fig. 5a) shows the ability to perform local conditional phasing. The x-axis is the seed haplotype resolution of the chromosome span produced by the simulation. The top panel shows the error rate for local conditional phasing using both uncorrected (top) and neighborhood corrected phasing (bottom, window size 3). Some variants cannot be inferred locally due to the neighborhood correction. The bottom panel shows the percentage of variants that remain unphased due to the neighborhood correction (neighbor) as a function of resolution. All simulations were done on chromosome 1 of GM 12878.

The table in fig. 5b demonstrates the resolution and overall accuracy of haplotype phasing using HaploSeq after local conditional phasing in GM12878 cells. Using local conditional phasing, the inventors increased the resolution from about 22% to about 81% on average. The table also describes the loss of resolution due to Neighborhood Correction (NC), which averages only about 3%. The inventors used a window size of 3 seed haplotype phasing variants to test the performance of local phasing. In addition to improved resolution, the inventors also obtained accurate haplotypes with an overall accuracy of about 98%. The accuracy here reflects the error of the MVP block of the initial HaplpSeq analysis and the error from the local conditional phasing. For some chromosomes, the accuracy was lower due to lower coverage (see table 1 below).

The graph in fig. 5c also demonstrates the ability to complete seed haplotypes (MVP blocks) of a chromosome span at different read lengths and coverage parameters. In all cases, a seed haplotype for the chromosomal span can be obtained using about 15x available coverage. All simulations were done on chromosome 1 of GM 12878. Similarly, the graph in fig. 5d illustrates the ability of different combinations of read length and overlay to produce a high resolution seed haplotype. In this case, a longer readout length contributes to a higher resolution of the Hi-C generated seed haplotype. All simulations were done on chromosome 1 of GM 12878.

The inventors describe herein a new strategy for reconstructing haplotypes of a chromosomal span of an organism. In contrast to other haplotype analysis Methods that reconstruct a complete haplotype from shotgun sequencing reads, the Methods disclosed herein can generate chromosome-spanning haplotypes (Fan et al, Nature Biotechnology 29,51-57 (2011); Yang et al, proceedings of the National Academy of Sciences of the United States of America 108,12-17 (2011); and Ma et al, Nature Methods 7, 299-. This method is most suitable for use in clinical and laboratory settings because the reagents and instrumentation required by HaploSeq are readily available. Furthermore, the method is more sensitive (apt) than methods based on genotyping of sperm cells (Kirkness et al Genome Research 23,826-832(2013)) because it allows the generation of whole Genome haplotypes from intact cells of any individual or cell line. Thus, HaploSeq has excellent utility in personalized medicine. Determination of the haplotype in an individual identifies new haplotype-disease associations, some of which have been identified on a smaller scale (He et al, American Journal of Human Genetics 92,667-680 (2013); Zeng et al, Genetic Epidemiology 28,70-82 (2005); and Chapman et al, Human Genetics 56,18-31 (2003)). In addition, for understanding allelic preference in gene expression, a complete haplotype will be crucial, which will contribute to the genetic and epigenetic polymorphisms in the population and their phenotypic consequences at the molecular level (Gimelbrant et al, science318,1136-1140 (2007); Kong et al, Nature 462, 868-. In addition, HaploSeq can be used to identify genetic polymorphisms in cancer cells that lead to resistance to, or are markers of resistance to, cancer therapeutic drugs. Finally, although in the following examples the methods are exemplified with diploid cells, improvements in experiments and calculations allow haplotype reconstruction in cells with higher ploidy, such as cancer cells. This may be helpful in understanding the results of genetic alterations, which are often seen in tumorigenesis.

Previously, proximity-ligation was used to study the spatial organization of chromosomes (Lieberman-Aiden et al, Science 326,289-293(2009)), rather than whole genome level haplotype determination. It is also a valuable tool in studying the genetic makeup of individuals, as disclosed herein. As indicated herein, proximity-based approaches can not only tell which cis-regulatory element physically interacts with which target gene, but also tell which alleles of these are linked on the same chromosome. Proximity-ligation data can also be used for genotyping in the same manner as WGS. Although variants that are far from the restriction enzyme cleavage site are less likely to be genotyped due to the preference Of the neighbor-joining method, such as Hi-C, population-based attribution Of variants that are not genotyped (Brown et al, American Journal Of Human Genetics 81,1084-1097(2007)) can be supplemented to accomplish increased genotypic calls (call). Since all this can be done using a single experiment, HaploSeq can be used as a general tool for whole genome analysis.

Targeted haplotype analysis and reconstruction

HaploSeq can also be used for targeted haplotyping of different regions. Once the ligation step is performed and a library of adjacent ligated fragments is obtained, custom designed oligonucleotides (which can be immobilized onto a solid surface) are introduced into the library in solution. These oligonucleotides "target" and hybridize to specific proximity-binding fragments. The proximity-ligated fragments that hybridize to such oligonucleotides are isolated to provide a new library. This library now contains a subset of contiguous ligated fragments that can be captured by the custom oligonucleotide. These fragments were sequenced and assembled to generate directed haplotypes. This method is useful for directed haplotype analysis of different regions. For example, targeted haplotype analysis of the HLA region (also known as the human major histocompatibility complex locus or the human leukocyte antigen locus), which is about 3.5Mb, can be performed by this method. This targeted haplotype analysis of the HLA region is useful in predicting the outcome of donor-recipient matching in organ transplantation.

Shown in fig. 9 is a schematic example of this targeted haplotype analysis. First, the cells are crosslinked and immobilized, thereby capturing spatially adjacent DNA elements (top left). Next, the cells are digested with, for example, HindIII and the fragmented ends are filled with biotinylated nucleotides, followed by religation of the digested ends (top middle) as performed in the Hi-C protocol. After PCR amplification of the Hi-C fragment, the final Hi-C library consists of a Hi-C ditag, which can be targeted by biotinylated RNA probes that have been designed to capture specific Hi-C fragments (top right). Next, solution hybridization of the RNA probe and Hi-C library can be performed using Oligonucleotide Capture Technology (OCT). Here, some Hi-C fragments have been targeted by two RNA probes, while other fragments are targeted by only one, and all non-targeted sequences will not be bound by RNA probes (bottom right). Next, streptavidin-coated beads were used to bind biotinylated RNA: DNA duplexes (bottom center), thereby extracting the targeted Hi-C fragment from the Hi-C library and creating a capture-HiC library. The bead-bound Hi-C library was then PCR amplified, purified, and next-generation sequenced (bottom left).

In the examples below, the above method was used for haplotyping human HLA regions, which are about 3.5 Mb. The capture-HiC probe design used in this example is shown in FIG. 10. Probe sequences were first computationally generated using the SureDesign software set (Agilent). UCSC genome browser shots of HLA loci (hg19) in humans are shown in fig. 10 a. Figure 10b shows HLA-DQB1 gene amplified UCSC genome browser shots to demonstrate this probe targeting approach. In this case, the inventors targeted +/-400bp near the cleavage site of the restriction enzyme used to prepare the Hi-C library, in this case HindIII ("targeting region" chase). For the targeting region, probes were designed at 4X tiling density (tiling density) with the goal of having each nucleotide of the targeting sequence covered by up to 4 probe sequences. It should also be noted that the probe itself does not overlap with the HindIII cleavage site ("HLA probe" tracking). It is also chosen not to target any sequence within the targeted region, which is referred to by RepeatMasker as containing repeated sequences ("missing region" and "RepeatMasker" traces).

The targeted haplotype analysis methods discussed herein, e.g., the capture-HiC method, provide the opportunity to phase entire HLA loci to a single haplotype block, enabling better prediction of HLA type matches in cell and organ transplantation protocols. Several studies have revealed a number Of disease-associated non-coding variants that are associated with specific HLA genes or alleles (Trowsdale et al, Annual Review Of Genomics And Manual Genetics 14, 301-. Thus, by delineating the single haplotype structure of HLA, the role of genetic variation in HLA-associated diseases and phenotypes can be systematically deconvolved.

As demonstrated herein, the capture-HiC method generally retains chromatin interaction measurements detected by conventional Hi-C experiments. Thus, capture-HiC can be used as a means to obtain long range interactions at a particular locus. For example, the long-range interaction mechanism of haplotype resolution behind genomic imprinting (imprinting) can be revealed using capture-HiC. Although several groups currently use 4C and 5C techniques to study targeted chromatin interactions (Simonis et al, Nature genetics38,1348-1354, (2006), and Dostie et al, Genome Research 16,1299-1309, (2006)), capture-HiC provides a more flexible methodology. Specifically, 4C is limited to the analysis of interactions using a single viewpoint (viewpoint), while 5C is limited by complex primer design, limited throughput, and analysis of only contiguous genomic regions. Alternatively, capture-HiC may be applied to detect interactions at thousands of viewpoints in a single experiment and be able to retrieve regions and customized 3D interaction frequencies in a non-biased way. In particular, capture-HiC provides the ability to be adapted to capture randomly scattered genomic elements, which can be applied to the general case given the relative proximity of the elements to the restriction enzyme cleavage sites. For example, by applying capture-HiC to genome-wide promoters or other genomic elements, genome-wide 3D regulatory interaction maps can be generated with unprecedented resolution and relatively low cost.

The Hi-C protocol has recently proven useful in de novo assembly of genomes ((Burton et al, Nat Biotechnol31,1119-1125, (2013) and Kaplan et al, Nat Biotechnol31, 1143-1147, (2013)). since capture-HiC yields a high quality chromatin interaction dataset, similar to Hi-C, this methodology can be used to generate diploid assembly of complex regions of the human or other large genomes, such as the T cell receptor beta (Trcb) locus (Spicuglia et al, Seminars in Immunology 22,330-336, (2010)) in addition, diploid assembly of highly heterozygous HLA loci in population scale can allow detection of new structural variants and enable accurate delineation of human migration patterns and correlation studies to discover individual medicine for multiple disease states. similarly, Hi-C has recently also been used in convolution genomics studies to resolve the presence of complex micro-organism mixtures (Beitel et al), PeerJ, doi:10.7287/PeerJ. preprints.260v1(2014) and Burton et al, specials-Level deconstruction of Metagenome Assemblies with Hi-C-Based Contact Prohibity Maps. G3, doi:10.1534/g3.114.011825 (2014)). With the advent of capture-HiC, different loci can be captured that are informative and discriminative enough to delineate species mixtures based on captured Hi-C fragments. In general, the capture-HiC disclosed herein and its application to targeted phasing, as well as other applications, enable new approaches in personalized clinical genomics and biomedical research.

The term "marker" or "conjugation marker", as used herein, refers to any compound or chemical moiety that is capable of being incorporated into a nucleic acid and that can provide a basis for selective purification. For example, a marker may include, but is not limited to, a labeled nucleotide linker, a labeled and/or modified nucleotide, a nick translation, a primer linker, or a tagged linker. The term "labeled nucleotide linker" refers to a class of markers comprising any nucleic acid sequence comprising a label incorporated (e.g., linked) into another nucleic acid sequence. For example, the tag can be used to selectively purify a nucleic acid sequence (i.e., e.g., by affinity chromatography). Such tags may include, but are not limited to, a biotin tag, a histidine tag (i.e., 6His), or a FLAG tag.

The term "labeled nucleotide", "labeled base", or "modified base" refers to a marker comprising any nucleotide base attached to a marker, wherein the marker comprises a specific moiety having a unique affinity for a ligand. Alternatively, the binding partner may have an affinity for the binding marker. In some examples, the marker includes, but is not limited to, a biotin tag, a histidine tag (i.e., 6His), or a FLAG tag. For example, dATP-biotin can be considered a labeled nucleotide. In some examples, fragmented nucleic acid sequences may be blunt-ended using labeled nucleotides, followed by blunt-ended ligation.

The term "label" or "detectable label" as used herein refers to any composition that can be detected by spectroscopic, photochemical, biochemical, immunochemical, electrical, optical or chemical means. Such labels include biotin for staining with labeled streptavidin conjugates, magnetic beads (e.g., Dynabeads)^TM) Fluorescent dyes (e.g., fluorescein, texas red, rhodamine)Green fluorescent protein, etc.), a radioactive label (e.g.,³H，¹²⁵I，³⁵S，¹⁴c, or³²P), enzymes (e.g., horseradish peroxidase, alkaline phosphatase, and other enzymes commonly used in ELISA), and heat labels, such as colloidal gold or colored glass or plastic (e.g., polystyrene, polypropylene, latex, etc.) beads. The labels contemplated in the present invention can be detected by a number of methods. For example, a film or scintillation counter may be used to detect the radioactive label and a photodetector may be used to detect the emitted light to detect the fluorescent marker. Enzymatic labels can generally be detected by providing a substrate to an enzyme and detecting the reaction product produced by the enzyme acting on the substrate, and calorimetric labels can be detected by visualizing only the colored label.

The term "fragment" refers to any nucleic acid sequence that is shorter than the sequence from which it is derived. Fragments may be of any size, ranging from several megabases and/or kilobases to only a few bases long. The experimental conditions may determine the expected fragment size, including, but not limited to, restriction enzyme digestion, sonication, acid incubation, base incubation, microfluidization, and the like.

The term "chromosome", as used herein, refers to a naturally occurring nucleic acid sequence comprising a series of functional regions, known as genes, which typically encode proteins. Other functional regions may include microRNAs or long non-coding RNAs, or other regulatory elements. These proteins may have biological functions or they may interact directly with the same or other chromosomes (i.e., e.g., regulatory chromosomes).

The term "genomic region" or "region" refers to a genome and/or chromosome of any defined length. For example, a genomic region may refer to an association (i.e., an interaction, for example) between more than one chromosome. Alternatively, a genomic region may refer to a complete chromosome or a partial chromosome. In addition, a genomic region may refer to a particular nucleic acid sequence (i.e., e.g., reading frame and/or regulatory gene) on a chromosome.

The term "fragmentation" refers to any process or method by which a compound or composition is separated into smaller units. For example, the isolation may include, but is not limited to, enzymatic cleavage (i.e., e.g., transposase-mediated fragmentation, restriction enzymes acting on nucleic acids or proteases acting on proteins), alkaline hydrolysis, acid hydrolysis, or heat-induced thermal destabilization.

The term "heat map" refers to any graphical representation of data in which the variables in a two-dimensional map are represented in color using numerical values. Heatmaps have been widely used to represent the expression levels of many genes in many comparable samples (e.g., cells in different states, samples from different patients), as obtained from DNA microarrays.

The term "genome" refers to any set of chromosomes and the genes they contain. For example, a genome can include, but is not limited to, a eukaryotic cell genome and a prokaryotic cell genome.

The terms "immobilization", "immobilization" or "immobilized" refer to any method or process that immobilizes any and all cellular processes. Thus, the fixed cells accurately maintain the spatial relationship between the intracellular components when fixed. Many chemicals are capable of providing fixation, including, but not limited to, formaldehyde, formalin, or glutaraldehyde.

The term "cross-linking" refers to any suitable chemical association between two compounds such that they are further processed as a unit. This stability may be based on covalent and/or non-covalent bonding. For example, nucleic acids and/or proteins can be crosslinked by chemical agents (i.e., e.g., fixatives) such that they maintain their spatial relationship during conventional laboratory methods (i.e., e.g., extraction, washing, centrifugation, etc.).

The term "linkage" is the unique linkage of two nucleic acid sequences by a joining marker. Such ligation may be generated by processes that include, but are not limited to, fragmentation, filling with labeled nucleotides, and blunt-end ligation. This linkage reflects the proximity of the two genomic regions, providing evidence of functional interaction. To facilitate sequencing analysis, the linkage comprising the ligation marker may optionally be purified.

The term "linked" as used herein refers to any linkage between two nucleic acids, which typically includes a phosphodiester bond. The ligation is typically facilitated by the presence of a catalytic enzyme (i.e., e.g., a ligase) in the presence of a cofactor reagent and an energy source (i.e., e.g., Adenosine Triphosphate (ATP)).

The term "restriction enzyme" refers to any protein that cleaves nucleic acid at a specific base pair sequence.

The term "selective purification" refers to any process or method by which a particular compound and/or complex can be removed from a mixture or composition. For example, such a process may be based on affinity chromatography, wherein the particular compound to be removed has a higher affinity for the chromatography substrate than the remainder of the mixture or composition. For example, nucleic acids labeled with biotin can be selectively purified from a mixture comprising nucleic acids not labeled with biotin by passing the mixture through a column comprising streptavidin.

The term "purified" or "isolated" refers to a nucleic acid composition that has been subjected to a treatment (e.g., fractionation) to remove various other components, and which substantially retains the biological activity of its expression. Where the term "substantially purified" is used, this name will refer to a composition in which the nucleic acid forms the major component of the composition, e.g., constitutes about 50%, about 60%, about 70%, about 80%, about 90%, about 95% or more (i.e., e.g., weight/weight and/or weight/volume) of the composition. The term "purified to homogeneity" is used to include compositions that have been purified to "apparent homogeneity" such that a single nucleic acid sequence is present (i.e., based on SDS-PAGE or HPLC analysis, for example). The purified composition is not intended to indicate that some minor impurities may remain. The term "substantially purified" refers to molecules (nucleic acid or amino acid sequences) that are removed, isolated or separated from their natural environment and are at least 60% free, preferably 75% free, and more preferably 90% free of components with which they are naturally associated. Thus, an "isolated polynucleotide" refers to a substantially purified polynucleotide.

"nucleic acid sequence" or "nucleotide sequence" refers to an oligonucleotide or polynucleotide, as well as fragments or portions thereof, and refers to DNA or RNA of genomic or synthetic origin, which may be single-stranded or double-stranded, and represents the sense or antisense strand.

The term "isolated nucleic acid" refers to any nucleic acid molecule that has been removed from its native state (e.g., removed from a cell, in preferred embodiments, free of other genomic nucleic acids).

The term "variant" of a nucleotide refers to a new nucleotide sequence that differs from the reference oligonucleotide by having deletions, insertions, and substitutions. These can be detected using a variety of methods (e.g., sequencing, hybridization assays, etc.). A "deletion" is defined as a change in the nucleotide or amino acid sequence in which one or more nucleotides or amino acid residues, respectively, are absent. An "insertion" or "addition" is a change in a nucleotide or amino acid sequence that has resulted in the addition of one or more nucleotides or amino acid residues. "substitution" results from the replacement of one or more nucleotides or amino acids by a different nucleotide or amino acid, respectively.

The term "homology" or "homology", as used herein, in reference to a nucleotide sequence refers to the degree of complementarity to other nucleotide sequences. There may be partial homology or complete homology (i.e., identity). A nucleotide sequence that is partially complementary, i.e., "substantially homologous," to a nucleic acid sequence is a sequence that at least partially inhibits hybridization of a fully complementary sequence to the target nucleic acid sequence. Inhibition of hybridization of a fully complementary sequence to a target sequence can be examined under low stringency conditions using hybridization assays (Southern or Northern blots, solution hybridization, etc.). Under low stringency conditions, a substantially homologous sequence or probe will compete for and inhibit the binding (i.e., hybridization) of a fully homologous sequence to the target sequence.

The term "cancer treatment drug" is used herein to refer to all chemotherapeutic agents to which cancer cells can acquire chemoresistance from time to time. Examples include JAK/STAT inhibitors, P13 kinase inhibitors, mTOR inhibitors, ErbB inhibitors, topoisomerase inhibitors, and the like.

embodiment 1. a method for whole chromosome haplotype analysis of an organism comprising:

providing a cell of the organism containing a genome having genomic DNA;

incubating the cells or nuclei thereof with a fixation agent for a period of time to allow the genomic DNA to crosslink in situ and thereby form crosslinked genomic DNA;

cleaving the proximally ligated complex to form proximally ligated DNA fragments;

obtaining a plurality of said proximally ligated DNA fragments to form a library;

sequencing the plurality of adjacently ligated DNA fragments to obtain a plurality of sequence reads, and

assembling the plurality of sequence reads to construct a haplotype of the chromosomal span of one or more chromosomes.

55页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种DNA稳定性同位素探针原位揭示河湖底泥中厌氧铁氨氧化细菌的方法

Genome-wide and targeted haplotype reconstruction

相关技术

网友询问留言