Trace reconstruction by reads with indeterminate errors

文档序号:723210 发布日期:2021-04-16 浏览:27次 中文

阅读说明:本技术 通过具有不定误差的读段的追踪重构 (Trace reconstruction by reads with indeterminate errors ) 是由 S·M·耶卡尼 M·Z·拉奇 于 2019-06-24 设计创作,主要内容包括:多核苷酸测序生成多核苷酸分子的多个读段。读段中的许多或所有包含误差。追踪重构需要由多核苷酸测序仪生成的多个读段,并使用这些多个读段来准确地重构多核苷酸分子的核苷酸序列。一些读段可以包含无法被校正的误差。因此,可能存在可以在其整个长度上被使用的读段以及具有无法被校正的不定误差的其他读段。当不定误差被发现时,读段具有误差的部分被跳过,并且读段在误差之后的序列被用于重构追踪,而不是丢弃整个读段。跳过的读段的量由误差后与其他读段的共有序列相匹配的子序列的定位确定。分析在由匹配的定位确定的定位处继续。(Polynucleotide sequencing generates multiple reads of a polynucleotide molecule. Many or all of the reads contain errors. The trace reconstruction requires multiple reads generated by a polynucleotide sequencer and uses these multiple reads to accurately reconstruct the nucleotide sequence of the polynucleotide molecule. Some reads may contain errors that cannot be corrected. Thus, there may be reads that can be used over their entire length as well as other reads with indeterminate errors that cannot be corrected. When an indeterminate error is found, the portion of the read with the error is skipped and the sequence of the read after the error is used to reconstruct the trace, rather than discarding the entire read. The amount of reads skipped is determined by the location of the subsequences that match the consensus sequence of other reads after an error. The analysis continues at the location determined by the matching location.)

1. A method of generating a consensus sequence from a plurality of reads of deoxyribonucleic acid (DNA) strands storing digital data, the plurality of reads generated with less than 30 x coverage by a sequencing technique that introduces a bursty error to reads of the plurality of reads, the method comprising:

omitting from the generation of a consensus output sequence a portion of the reads that contain the bursty error; and

generating the consensus output sequence using portions of the read from either side of the portion of the read containing the bursty error.

2. The method of claim 1, wherein the plurality of reads are generated with less than 25 x coverage.

3. The method of claim 1 or 2, further comprising: identifying an end of the portion of the read that contains the bursty error by identifying a location in the read that is flanked by at least one of: a backward matching region that matches the consensus sequence, or a forward matching region that matches a sequence generated by majority voting from sequences of at least two other reads of the plurality of reads.

4. The method of claim 1 or 2, wherein the consensus sequence is generated by: comparing values for the plurality of reads at the compared locations while aligning the plurality of reads to one another based on the insertions, deletions, and substitutions.

5. A method, comprising:

identifying an indeterminate error in a read of a plurality of sequence reads from a polynucleotide sequencer relative to a consensus output sequence of the plurality of sequence reads, the indeterminate error being located at a first position;

defining a search window comprising a second location in the read, the second location being located at least one distance delayed beyond the first location;

calculating edit distances for subsequences in the search window to comparison sequences, the comparison sequences including a backward matching sequence that is a common output sequence of corresponding positions in at least two of the plurality of sequence reads;

identifying a subsequence in the search window having an edit distance less than a threshold;

selecting a candidate position within the subsequence based on the length of the backward matching sequence;

setting a position of a comparison with at least two other reads from the plurality of sequence reads as a third position in the read, the third position being one position beyond the candidate position; and

determining the consensus output sequence from the plurality of sequence reads at the third location.

6. The method of claim 5, wherein the consensus output sequence of the plurality of sequence reads is determined by: identifying a majority vote for a considered position while accounting for insertion, deletion, and substitution errors, and the consensus output sequence of the plurality of sequence reads proceeds sequentially from a currently considered base call position to an adjacent base call position.

7. The method of claim 5 or 6, further comprising: a second sub-sequence in the search window having an edit distance less than the threshold is identified and one of the sub-sequence or the second sub-sequence that is closest to the candidate position is selected.

8. The method of claim 5 or 6, wherein the compared sequences further comprise a forward matched sequence that is a majority consensus vote for base calls from the at least two other reads from the plurality of sequence reads.

9. The method of claim 8, wherein the delay is between 4 and 10 positions in length, the search window is between 9 and 20 positions in length, the backward matching sequence is between 1 and 11 positions in length, and the forward matching sequence is between 1 and 10 positions in length.

10. The method of claim 9, wherein a sum of the length of the backward matching sequence and the length of the forward matching sequence is between 5 and 15 positions.

11. The method of claim 8, wherein the length of the delay, the length of the search window, the length of the backward matching sequence, and the length of the forward matching sequence are each experimentally determined by: different values for each length are tested and a combination of values that results in the maximum number of read clusters being recovered from the plurality of sequence reads is selected.

12. A system for alignment of sequence reads, comprising:

one or more processing units;

a memory coupled to the one or more processing units;

a compare anchor module stored in the memory and implemented by the one or more processing units to:

identifying an aligned anchor sequence having a predetermined sequence present in the first sequence read and the second sequence read,

dividing the first sequence read into a first sub-read extending from the beginning of the first sequence read to the alignment anchor sequence and a second sub-read extending from the alignment anchor sequence to the end of the first sequence read,

dividing the second sequence read into a third sub-read extending from the beginning of the second sequence read to the alignment anchor sequence and a fourth sub-read extending from the alignment anchor sequence to the end of the second sequence read; and

a read alignment module stored in the memory and implemented on the one or more processing units to:

aligning the first sub-read with the third sub-read based on the beginning of the first sequence read, the beginning of the second sequence read, and the alignment anchor sequence, and

aligning the second sub-read with the fourth sub-read based on the alignment anchor sequence, the end of the first sequence read, and the end of the second sequence read.

13. The system of claim 12, wherein the alignment anchor sequence has a sequence that is not aligned to itself.

14. The system of claim 12, further comprising an oligonucleotide synthesizer configured to synthesize a first polynucleotide molecule with the first sequence read comprising only one instance of the aligned anchor sequences and synthesize a second polynucleotide sequence, wherein the second polynucleotide sequence comprises only one instance of the aligned anchor sequences.

15. The system of any of claims 12 to 14, further comprising a consensus output sequence generator module stored in the memory and implemented on the one or more processing units to:

generating a first forward consensus sequence from the first sub-read and the third sub-read;

generating a first reverse consensus sequence from the first sub-read and the third sub-read;

combining the first forward consensus sequence and the first reverse consensus sequence into a first combined consensus sequence;

generating a second forward consensus sequence from the second sub-read and the fourth sub-read;

generating a second reverse consensus sequence from the second sub-read and the fourth sub-read;

combining the second forward consensus sequence and the second reverse consensus sequence into a second combined consensus sequence;

attaching the second combination consensus sequence to the end of the first combination consensus sequence by alignment at the alignment anchor sequence; and

deleting the aligned anchor sequences.

Background

Much of the data in the world today is stored on magnetic and optical media. Tape technology has recently found a significant density increase for individual tape cartridges storing 185TB and is the most dense form of storage available commercially today, on the order of 10GB/mm3. Recent studies have reported the feasibility of optical discs capable of storing 1PB, resulting in about 100GB/mm3The density of (c). Despite this improvement, Zernike bytes (2) are stored70Bytes or gigabytes) of data would still take millions of units and use a large amount of physical space. Storage density is but one aspect of storage media; durability is also important. The lifetime of the rotating disc is 3 to 5 years, and the lifetime of the magnetic tape is 10 to 30 years. Long-term archival storage requires data refreshing to replace failed cells and refresh techniques.

The demand for data storage has grown exponentially, but the capacity of existing storage media has not kept pace. Polymers of deoxyribonucleic acid (DNA) are capable of storing information at high densities. Theoretical density limit of 1 EByte/mm3(109GB/mm3). Less than 100 grams of DNA can store all artificial data in the world today. DNA is also very durable, and under certain storage conditions, a half-life of over 500 years is observed. Therefore, DNA is attractive as an information storage technology due to its high information density and long life. Yet another advantage of DNA as a storage medium is its continued relevance. The operating system and standards of the storage medium will change, potentially making data on the old storage system inaccessible. But DNA-based storage has the benefit of a persistent association: there are strong reasons to maintain a technology that can read and manipulate DNA as long as there is a life based on DNA.

Although DNA storage systems have advantages, it must overcome several challenges. For example, DNA synthesis, degradation during storage, and sequencing are potential sources of error. Thus, the DNA sequence output by the sequencer may differ from the DNA sequence originally supplied to the oligonucleotide synthesizer.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Binary data currently used by computers to store text files, audio files, video files, software and the like can be represented as a series of nucleic acids (i.e., DNA or ribonucleic acid (RNA)) in a polynucleotide. There are various techniques for representing 0 and 1 of binary data as a series of nucleotides. A variety of techniques are known to those of ordinary skill in the art. The polynucleotide sequence is designed to store binary data and then synthesized using an oligonucleotide synthesizer. The synthesized polynucleotides are placed in a storage device and eventually read by a polynucleotide sequencer. Data generated by the polynucleotide sequencer is decoded to recover the stored binary data. The machines that write and read polynucleotide sequences are not 100% accurate and introduce errors. Some types of errors (such as insertions, deletions or substitutions of nucleotides) may be identified and corrected. Other types of errors (particularly "bursty" errors, where there are multiple errors in local "bursts" that are adjacent or near each other) may be difficult or impossible to correct. Thus, with some decoding techniques, sequence reads (reads) that include bursty errors may not be available. The present disclosure provides techniques for extracting useful information from reads of a polynucleotide sequence that includes a sudden error or other type of indeterminate error (indeterminant error).

Polynucleotide sequencing of a batch of polynucleotides can produce reads that have a total length many times greater than the total length of all polynucleotides in the sequencing batch. This is called "depth of coverage". A depth of coverage greater than 1 (e.g., 10 ×, 20 ×, 30 ×, etc.) indicates that on average, the number of times each polynucleotide is provided to the sequencer is indicated by the depth of coverage. Because this is an average, some polynucleotides may not be sequenced at all, while others are sequenced many times more than the coverage level. For a single sequencing run, the depth of coverage can be calculated by dividing the total number of base pairs sequenced by the total length of the polynucleotide provided to the sequencing machine. Each read in a plurality of sequencing reads of the same polynucleotide may have a slightly different sequence. Reads that are similar to each other and therefore likely all generated by sequencing the same polynucleotide may be grouped together in clusters for further analysis. On average, the number of reads in a cluster is the same as the depth of coverage.

An analysis of the reads in the cluster is performed to identify a single common output sequence from the plurality of reads in the cluster. This type of analysis may be referred to as "trace reconstruction". The consensus output sequence is more likely to represent the actual nucleotide sequence in the source polynucleotide molecule than any single read. Techniques for creating consensus sequences from multiple reads are known to those of skill in the art. Some techniques identify errors in a single location of a read and correct the errors based on the sequence of other reads at the same location. Errors that may be identified and corrected include insertions, deletions, and substitutions. However, when multiple errors are close together, such as an insertion followed by a substitution, followed by two deletions, current techniques for creating a consensus output sequence fail to correctly identify the errors, and the entire read may be ignored in determining the consensus output sequence. Techniques discussed in this disclosure include identifying the location of indeterminate errors, skipping portions of reads that contain indeterminate errors, and determining a consensus output sequence using subsequent base calls in the reads, rather than ignoring or discarding data from reads with indeterminate errors.

The indeterminate error may be identified at a location in the read where the consensus sequence derived from other reads in the same cluster do not match and the type of error cannot be determined. For example, if a given base call in a read differs from the base calls of all other reads at the same position, but the difference cannot be identified as an insertion, deletion, or substitution error, then the base call may be characterized as an indeterminate error. A set number of positions may be skipped and the search window of reads down is evaluated to determine if there are matching subsequences between reads with indeterminate errors and other reads in the cluster. Each sub-sequence of a given length within the search window may be evaluated to determine whether it matches a sequence from other reads. Matches can be determined in part by matches between base calls in reads with varying errors and base calls of the consensus sequence. Additionally, matches can be determined by comparing the base calls of reads with varying errors to simple majority votes of other reads. Simple majority voting will actually take into account the most common base calls at a given position, without taking into account errors such as insertions, deletions and substitutions.

If a match is found, then a single position within the matching subsequence is taken as a candidate position, and subsequent positions adjacent to the candidate position are used as positions for comparison to continue using the reads for determining a common output sequence. Thus, the position in the read before the indeterminate error can be used with some or all of the other reads in the cluster to determine a portion of the consensus output sequence. The portions of the reads with indeterminate errors and some number of locations after the indeterminate errors are ignored, and therefore the common output sequence for these locations is determined by the other reads in the cluster. After the candidate location is located, the next location will again be used with other reads in the cluster to determine a consensus output sequence. Thus, rather than ignoring all data provided by a read that includes indeterminate errors, only the portions of the read near the indeterminate errors are ignored, and the portions of the read before and after the errors may be used for determination of the consensus output sequence.

The alignment of the multiple reads in the cluster (alignment) may be based on the first position in the read or the last position in the read. As the generation of the consensus output sequence proceeds from location to location (e.g., "front" to "back" or "back" to "front"), errors and uncertainties may accumulate as insertions and deletions cause reads to be out of phase with respect to each other. Known sequences intentionally inserted into all reads at the time of synthesis serve as alignment anchors and provide a reference location (location) other than the end to realign multiple reads. Thus, multiple reads in a cluster may be aligned relative to each other based on the alignment anchor and based on the first and last locations of the reads. Reads may be designed to include one or more alignment anchors. Because the alignment anchor includes alternate alignment points in addition to the beginning and end of the read, the read may be split into shorter segments at the alignment anchor, and each of the shorter segments may be individually evaluated to determine a common output sequence. The consensus output sequences for each of the shorter segments may be re-joined to create a single consensus output sequence for the full length of the original read.

Particular implementations and examples in this disclosure may refer to DNA only; however, it is to be understood that the present disclosure is equally applicable to any polynucleotide including DNA, RNA, DNA-RNA hybrids, and polynucleotides including synthetic or non-natural bases.

Drawings

The detailed description is set forth with reference to the accompanying drawings. In the drawings, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears. The use of the same reference symbols in different drawings indicates similar or identical items.

FIG. 1 shows an illustrative architecture for tracking the operation of a reconstruction system.

FIG. 2 is an illustrative schematic showing the use of a tracking reconstruction system.

FIG. 3 shows an illustrative representation of a substitution error identified in accordance with the techniques of this disclosure.

Fig. 4 shows an illustrative representation of a missing error identified in accordance with the techniques of this disclosure.

Fig. 5 shows an illustrative representation of insertion errors identified in accordance with the techniques of this disclosure.

FIG. 6 shows an illustrative representation of indeterminate error and inactivity tracking in accordance with the techniques of this disclosure.

Fig. 7 shows an illustrative representation of introducing a delay into an inactive track and locating candidate locations within the inactive track.

FIG. 8 shows placement of alignment anchors within a read.

FIG. 9 shows the generation of partial consensus sequences by segmentation of reads partitioned by alignment anchors.

FIG. 10 shows a block diagram of an illustrative tracking reconstruction system.

FIGS. 11A and 11B show an illustrative process for determining a consensus output sequence from multiple reads.

Fig. 12A and 12B show an illustrative process for generating binary data from reads received from a polynucleotide sequencer.

FIG. 13 shows an illustrative process for generating a consensus output sequence by omitting a portion of a read that contains a bursty error.

FIG. 14 shows an illustrative process for analyzing reads containing indeterminate errors in determining whether there are located subsequences that exceed the indeterminate errors that match other reads.

Detailed Description

As mentioned above, DNA has great potential as a storage medium for digital information. However, dealing with errors that may corrupt data is one of the challenges of storing digital data using DNA. There are many steps involved in converting digital data into synthetic DNA molecules and then recovering the digital data from the synthetic DNA molecules. The techniques described in this disclosure provide techniques for using some of the information in a DNA read when the read includes an indeterminate error.

The term "DNA strand" or simply "strand" refers to a DNA molecule. As used herein, "DNA reads," "sequence reads," or simply "reads" refer to data strings generated by a polynucleotide sequencer when the polynucleotide sequencer reads the sequence of a DNA strand. Because reads are strings of data, they may also be referred to as "strings". However, reads generated by polynucleotide sequencers often contain errors and thus do not represent the structure of a DNA strand with 100% accuracy. Fortunately, most DNA sequencing techniques are capable of generating multiple reads of a DNA strand. The reads are referred to as "noisy reads" because each read may contain one or more errors, the distribution of which is substantially random. A given read may also be error free, but unless referenced to each other, it may not be known which reads are error free and which contain errors. The techniques of this disclosure use multiple different noise reads for a single DNA strand to create a common output sequence that is likely to represent the true sequence of the DNA strand. The consensus output sequence is a data string similar to any read, but the consensus output sequence is generated by analysis of the reads, rather than being output directly from the polynucleotide sequencer. The process of changing from many noisy reads to a presumably accurate, common output sequence is called "chase reconstruction".

Naturally occurring DNA strands are composed of four types of nucleotides: adenine (a), cytosine (C), guanine (G) and thymine (T). A DNA strand or polynucleotide is a linear sequence of these nucleotides. The two ends of a DNA strand (referred to as the 5 'and 3' ends) are chemically different. DNA sequences are conventionally represented starting from the 5' nucleotide end. The interaction between different strands is predictable based on sequence: the two single strands may bind to each other and form a double helix (if they are complementary): a in one strand aligns with T in the other strand, and C and G align identically. Two strands in a duplex have opposite directionality (5 'end attached to the 3' end of the other strand), and thus, the two sequences are "reverse complements" of each other. The two strands need not be perfectly complementary to each other to be able to bind to each other. Ribonucleic acid (RNA) has a similar structure to DNA, and naturally occurring RNA consists of four nucleotides A, C, G and uracil (U), rather than T. For the sake of brevity and readability, the discussion in this disclosure refers only to DNA, but RNA can be used instead of or in combination with DNA.

The tracking reconstruction problem can be stated using the following mathematical notation. Let Σ denote a finite alphabet, e.g., Σ, { a, C, G, T }. Let X be an element of sigmanIt may be arbitrary or random for the sequence of interest. The goal is to accurately reconstruct the sequence of DNA strand X through a collection of noise reads.

Some sequencing techniques, such as sequencing by synthesis (discussed more below), have error distributions in which noise is distributed independently, at least to some extent. Thus, it is possible to provideThe noise can be modeled as being distributed independently and identically throughout the chain (i.i.d.). This type of noise may be created using the synthesized test data. Doing so, let Y1、Y2、…、YmIs an i.i.d. sequence obtained from X in the following manner. Let p bed、piAnd psRespectively represent deletion, insertion and substitution probabilities such that p ═ pd+pi+ps∈[0,1]. To obtain a noise read Y, the following is done starting with an empty string and for the compared position j 1, 2.

(error free) with a probability of 1-p, copying X [ j ] to the end of Y, and increasing j by 1;

(absence) probability of pdJ is increased by 1;

(insert) probability piIs to be compared with X [ j]Copying to the end of Y, adding a random symbol at the end of Y, and increasing j by 1;

(substitution) probability of psA random symbol is added at the end of Y and j is incremented by 1.

Other sequencing techniques, such as nanopore (nanopore) sequencing (discussed more below), have different error distributions, where the errors are not distributed independently but tend to be clustered such that if one error is present, the other error is likely to be nearby. Nanopore sequencing has more errors than sequencing by synthesis. Compared to about 1-2%, about 12%. By creating localized high error regions, a "burst" of errors can be created in the string. In addition to single position errors, nanopore sequencing results may also have large errors, such as the exchange of adjacent base blocks.

Identifying a single consensus sequence from multiple noise reads-trace reconstruction-output estimation readsWhich is an estimate of the true sequence X of the DNA strand. The estimate is created from the number m noisy reads Y. Therefore, the temperature of the molten metal is controlled,the goal is to reconstruct X accurately, i.e. to minimizeMinimizing when considering or ignoring errors in noisy readsA different probability than X can be accomplished by using as many portions of the noise read Y.

Fig. 1 shows an illustrative architecture 100 for implementing a tracking reconstruction system 102. In short, digital information intended for DNA molecule storage is converted into information representing a string of nucleotides. Information indicating a nucleotide string (i.e., an alphabetical string indicating the base order of nucleotides) is used as a DNA synthesis template, which instructs the oligonucleotide synthesizer 104 to synthesize DNA molecule nucleotides by nucleotide chemistry. Artificially synthesized DNA allows the creation of synthetic DNA molecules with an arbitrary series of bases, where the individual monomers of the bases are assembled together as a polymer of nucleotides. The oligonucleotide synthesizer 104 can be any oligonucleotide synthesizer using any recognized DNA synthesis technique. The term "oligonucleotide" as used herein is defined as a molecule comprising two or more nucleotides.

The coupling (coupling) efficiency of the synthesis process is the probability that a nucleotide will bind to an existing partial strand at each step of the process. Although the coupling efficiency of each step can be higher than 99%, this small error still leads to an exponential decrease in product yield with increasing length and limits the size of oligonucleotides that can currently be efficiently synthesized to around 200 nucleotides. Thus, the length of the stored DNA strand is about 100 to 200 base pairs (bp). This length will increase with the development of oligonucleotide synthesis technology.

The synthetic DNA produced by the oligonucleotide synthesizer 104 may be transferred to a DNA repository 106. There are many possible ways to construct the DNA repository 106. In addition to structures at the molecular level by attaching identification sequences to DNA strands, the DNA repository 106 may be constructed by physically separating DNA strands into one or more pools of DNA 108. Here, the DNA pool 108 is shown as a flip-top tube, which represents a physical container for multiple DNA strands. When DNA is stored in a liquid solution, DNA strands are generally most easily manipulated by biotechnology. Thus, the DNA pool 108 may be implemented as a liquid-filled chamber, in many implementations water, and there may be thousands, millions, or more individual DNA molecules in the DNA pool 108.

In addition to being in a liquid suspension, the DNA strands in the DNA repository 106 may also be in a glassy (or vitreous) state, as a lyophilized product, as part of a salt, adsorbed on the surface of the nanoparticles or in another format. The structure of the DNA pool 108 may be implemented as any type of mechanical, biological, or chemical arrangement that maintains a volume of liquid comprising DNA in physical location. The storage device may also be in a non-liquid form, such as a solid bead or by encapsulation. For example, a single flat surface on which a droplet is present is one implementation of the DNA pool 108, and the droplet is held in part by the surface tension of the liquid, even if not completely enclosed within the container. DNA pool 108 may include single stranded DNA (ssdna), double stranded DNA (dsdna), single stranded RNA (ssrna), double stranded RNA (dsrna), DNA-RNA hybrid strands, or any combination including the use of non-natural bases.

DNA strands deleted from the DNA repository 106 may be sequenced with the polynucleotide sequencer 110. In some implementations, DNA strands can be prepared to be sequenced by amplification using the Polymerase Chain Reaction (PCR) to create a large number of DNA strands that are identical copies of each other. The need for PCR amplification prior to sequencing may depend on the particular sequencing technique used. Although the level of PCR is much lower than current sequencing techniques, PCR itself can be a source of error. Currently, PCR techniques typically introduce about one error per 10,000 bases. Thus, on average, there is one error in the PCR results for every 100 reads of 100 bases. The errors introduced by PCR are typically randomly distributed, so the tracking reconstruction system can correct for some PCR-induced errors.

As mentioned above, the polynucleotide sequencer 110 reads the order of nucleotide bases in a DNA strand and generates one or more reads from the strand. The polynucleotide sequencer 110 uses a variety of techniques to interpret molecular information and can introduce errors into the data in a systematic and random manner. Errors can generally be classified as substitution errors, where the true nucleotide is replaced by an incorrect base call (e.g., a is exchanged for G), an insertion or deletion, where a random base call is inserted (e.g., AGT to AGCT) or a deletion (e.g., AGTA to ATA). Transposition of exchanging longer DNA regions is another type of error. Each position in a read is a single base call determined by the polynucleotide sequencer 110 based on properties sensed by components of the polynucleotide sequencer 110. The various properties sensed by the polynucleotide sequencer 110 vary depending on the particular sequencing technology used. Base calling refers to determining which of the four nucleotide bases (A, G, C and T (or U)) in a DNA (or RNA) strand is present at a given position in the strand. Sometimes, base calls are erroneous, and this is a source of error introduced by sequencing. Polynucleotide sequencing includes any method or technique used to generate base calls from a DNA or RNA strand.

A sequencing technique that may be used is sequencing by synthesis (S.) (Sequencing). Synthetic sequencing is based on the amplification of DNA on a solid surface using a fold-back PCR and anchor primers. The DNA is fragmented and adapters are added to the 5 'and 3' ends of the fragments. DNA fragments attached to the surface of the flow cell channel are extended and bridge amplified. The fragments become double stranded and the double stranded molecules are denatured. Multiple cycles of denaturing solid phase amplification can then be performed to create millions of clusters of about 1,000 copies of the same template single-stranded DNA molecule in each channel of the flow cell. Primers, DNA polymerase and four fluorophore-labeled reversible terminator nucleotides were used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophore, and an image is captured and the identity of the first base is recorded. The 3' terminator and fluorophore from each incorporated base are deleted and the incorporation, detection and identification steps are repeated.

Another example of a sequencing technique that can be used is nanopore sequencing. Nanopores are small pores of about 1 nanometer in diameter. Immersing the nanopore in a conducting fluid and applying an electrical potential across the nanopore results in a small current generated due to conduction of ions through the nanopore. The amount of current flowing through the nanopore is sensitive to the size of the nanopore. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule blocks the nanopore to a different degree. Thus, as a DNA molecule passes through a nanopore, a change in the current passing through the nanopore represents a reading of the DNA sequence.

Other sequencing technologies currently known include single molecule real-time (SMRT) from Pacific BiosciencesTM) Technique, Helicos true Single molecule sequencing (tSMS), SOLID available from Applied BiosystemsTMTechniques, the use of chemically sensitive field effect transistor (chemFET) arrays, and the use of electron microscopy to read nucleotide base sequences.

All techniques used to sequence DNA are associated with some degree of error, and the type and frequency of error varies from sequencing technique to sequencing technique. For example, sequencing-by-synthesis creates errors in about 2% of base calls, and the errors tend to be distributed independently and uniformly (i.i.d.). These errors are mostly substitution errors. The error rate for nanopore sequencing is higher, approximately 15% to 40%, and most of the errors caused by this sequencing technique are deletions. Errors in the sequence generated by nanopore sequencing tend to occur in clusters, so if one location is in error, then there is a high probability that an adjacent location will also be in error. Thus, the error may be considered a "burst". The error distribution for a particular sequencing technique may describe the overall frequency of the errors, the nature of the distribution of the errors, and the relative frequencies of various types of errors.

In some implementations, the polynucleotide sequencer 110 provides quality information that indicates a confidence level for the accuracy of a given base call. The quality information may indicate that there is a high confidence level or a low confidence level in a particular base call. For example, the quality information may be expressed as a percentage of the accuracy of the base call, such as an 80% confidence. Additionally, the quality information may be expressed as a confidence level that each of the four bases is the correct base call for a given position in the DNA strand. For example, the quality information may indicate an 80% confidence that a base call is T, an 18% confidence that a base call is a, a 1% confidence that a base call is G, and a 1% confidence that a base call is C. Thus, the result of this base call will be T, since the confidence that this nucleotide is the correct base call is higher than any other nucleotide. The quality information does not identify the source of the error, but merely suggests which base calls are more or less likely to be accurate.

The polynucleotide sequencer 110 provides output, in electronic format, to the tracking reconstruction system 102, a plurality of noise reads (typically a plurality of DNA strands). The output may include quality information as metadata that is otherwise associated with the reads generated by the polynucleotide sequencer 110.

The tracking reconstruction system 102 may be implemented as an integral part of the polynucleotide sequencer 110. The polynucleotide sequencer 110 may include an on-board computer that implements the tracking reconstruction system 102. Alternatively, the tracking reconstruction system 102 may be implemented as part of a separate computing device 112, the computing device 112 being directly connected to the polynucleotide sequencer 110 through a wired or wireless connection that does not cross a network. For example, the computing device 112 may be a desktop computer or a notebook computer that is used to receive data from the polynucleotide sequencer 110 and/or control the polynucleotide sequencer 110. The wired connection may include one or more wires or cables that physically connect the computing device 112 to the polynucleotide sequencer 110. The wired connection may be created through a headset cable, a telephone cable, a SCSI cable, a USB cable, an ethernet cable, FireWire, or the like. The wireless connection may be created by radio waves (e.g., bluetooth, ANT, any version of Wi-Fi IEEE 802.11, etc.), infrared light, etc. The tracking reconstruction system 102 can also be implemented as part of a cloud or network-based system using one or more servers 114 in communication with the polynucleotide sequencer 110 via a network 116. The network 116 may be implemented as any type of communications network, such as a local area network, a wide area network, a mesh network, an ad hoc network, a peer-to-peer network, the internet, a cable network, a telephone network, and so forth. Additionally, the tracking reconstruction system 102 may be implemented in part by any combination of the polynucleotide sequencer 110, the computing device 112, and the server 114.

FIG. 2 illustrates the use of the tracking reconstruction system 102 as part of the process of decoding information stored in a synthetic DNA strand 200. The synthetic DNA strand 200 is a molecule with a specific nucleotide base sequence. As shown in fig. 1, a synthetic DNA strand 200 may be stored in DNA pool 108. The synthetic DNA strand 200 may be present in the DNA pool 108 as a single-stranded molecule, or may be hybridized with a complementary ssDNA molecule to form dsDNA. The polynucleotide sequencer 110 generates the output of multiple noise reads 202 from a single synthetic DNA strand 200. The number of reads correlated with the depth of coverage of sequencing. The greater the depth of sequencing, the more average sequences are generated from the DNA strands. Due to sequencing errors, many of the multiple reads of the same DNA strand may have different sequences. Each of the reads has a length (n), which in this example is nine (corresponding to nine bases in the synthetic DNA strand 200). In actual sequencer data, noisy reads may have arbitrary lengths that are not all equal to each other. Deletions and insertions are one cause of read length variation. The length of a given read may be denoted as n, but n is not necessarily the same for all reads. In practical implementations, the length of the reads may be between 100 and 200 due to current limitations on the maximum length of DNA strands that can be artificially synthesized. The location on the read may be referred to as a "position," such as from position one to position nine in this example. As used herein, "base" refers to the location of a given monomer in a DNA molecule, while "position" refers to the location along a data string such as a read. Thus, assuming no error, the third base in the synthetic DNA strand 200 corresponds to the third position in the reads generated by the polynucleotide sequencer 110.

In this example, the number (m) of noisy reads 202 provided to the tracking reconstruction system 102 is 5. However, any number may be used. In some implementations, the number of noisy reads 202 provided to the tracking reconstruction system 102 may be 10, 20, or 100. The number of noisy reads 202 provided to the tracking reconstruction system 102 may be less than the total number of reads generated by the polynucleotide sequencer 110. A subset of the total number of reads generated by the polynucleotide sequencer 110 may be randomly selected or heuristically used for analysis by the tracking reconstruction system 102. In addition to random selection, other techniques may be used to select which subset of reads to pass to the trace reconstruction system 102. For example, the quality information may be used to identify the m reads with the highest confidence in the base call from all reads generated by the polynucleotide sequencer 110. In some implementations, only reads having certain lengths are selected.

The tracking reconstruction system 102 analyzes the noise reads 202 and generates a common output sequence 204 in accordance with the techniques of this disclosure. Consensus output sequence 204 represents the nucleotide sequence in synthetic DNA strand 200 with less error than any of the individual noise reads 202 and ideally no error.

The converter 206 converts the consensus output sequence 204 into binary data 208, thereby retrieving the digital information stored in the DNA repository 106. Converter 206 may use additional error correction techniques to correct any errors that may remain in common output sequence 204. Thus, the tracking reconstruction system 102 does not have to correct all types of errors, as there are other error correction techniques that can be used to recover the binary data 208.

Although the implementations discussed herein involve obtaining binary data 208 from reads of synthetic DNA strands 200, the tracking reconstruction system 102 works equally well on reads of natural DNA strands. The output from the polynucleotide sequencer 110 is a plurality of noise reads 202 for both synthetic and natural DNA. Thus, in implementations that do not involve the use of synthetic DNA to store binary data 208, the tracking reconstruction system 102 may be used to remove errors from reads generated by the polynucleotide sequencer 110.

Fig. 3 illustrates a technique for identifying a substitution error. Reads may be aligned at the starting location or any other location, such as an alignment anchor discussed in more detail below. The starting position may correspond to the 5' end of the DNA strand from which the reads were generated. In the drawings of the present disclosure, the 5' end is oriented to the left. Location of comparison across reads300 are represented by solid rectangular boxes. As each location in the read is analyzed in turn, the compared locations 300 may move from left to right along the read. Following the compared position 300 is a look-ahead window 302 represented by two dashed rectangular boxes. Look-ahead window 302 "looks forward" to the right of compared position 300 or toward the 3' end. I.e. if the read is represented as YjAnd the compared position 300 is denoted as p [ j ]]Then the look-ahead window 302 of length w is formed by Yj[p[j]+1],...Yj[p[j]+w]The constituent substrings. As the compared location 300 moves, the look-ahead window 302 may move along the read. In this example, the length of look-ahead window 302 is two locations, but it may be longer, such as three, four, or more.

The majority of the consensus bases 304 are the most frequent base calls at the position 300 of the comparison. Here, four of the five reads have a G at that location, and one read has a T. Since G is the most numerous base calls, most of the common bases 304 are G. In some implementations, the majority of the common bases 304 can be determined by considering quality information of the corresponding base calls at the compared positions 300. Each base call at the compared position 300 may be weighted based on the associated quality information. For example, if a given base call is G with 80% confidence, then 0.8G would be counted when determining the majority consensus base 304, while a 30% confidence that a given base call is C would be counted when determining the majority consensus base 304 as 0.3C. Thus, the confidence of a single base call can be considered when identifying the majority of common bases 304 for a given compared position 300. Additionally or alternatively, all base calls having quality information indicating that the confidence of the base call is less than a threshold level (e.g., 15%) may be omitted from the determination of the multi-base call 304. The majority of the common bases 304 may be randomly selected from two or more different base calls if they are present at equal frequency.

Reads having a base call at position 300 of comparison that differs from the majority of the common bases 304 are referred to as variant reads. Thus, for variant reads YkBase call Yk[p[k]]Is shared by a pluralityBase 304 does not match. In this example, the variant read is the third chain 308. When analyzed at position 300 for a given comparison, there may be zero, one, or more than one variant read in any read packet.

The look-ahead window consensus 306 is determined from the look-ahead window 302 in a similar manner as the majority of the consensus bases 304. The determination of the look-ahead window sharing 306 may also be affected by the quality information. The look-ahead window consensus 306 may be based on base calls weighted by their respective confidence levels and/or by omitting base calls having confidence levels below a threshold. The look-ahead window total 306 is determined by considering reads at the compared position 300 that are not variant reads. Thus, here, look-ahead window 302 is shown to cover non-variant reads but not variant reads 308. In this example, the most common base call in the first position of look-ahead window 302 is C, and the most common base call in the second position of look-ahead window 302 is T. Thus, the look-ahead window has 306 strings of two positions that are base calls: and (6) CT.

Next, base calls in the look-ahead window 310 of variant reads (CT) are compared to the look-ahead window consensus 306 (CT). Because they match, the mismatch at the second location in the third read 308 is classified as a substitution. In the mathematical notation, if Yk[p[k]+1],...Yk[p[k]+w]In common with the look-ahead window, then Y will bekA mismatch in (b) is classified as a substitution. Most of the consensus base 304 is used as the base call for that position in consensus output sequence 204.

After the error types of the variant reads are classified, the compared locations 300 are moved to continue analyzing the reads. For each read that is not a variant read, the compared position 300 is shifted one position to the right. In this example, these are a first read, a second read, a fourth read, and a fifth read. For variant reads where the error type is classified as replacement (here, the third read 308), the compared position 300 is also shifted one position to the right. Thus, as shown in the lower portion of FIG. 3, the compared position 300 is shifted one position to the right for all reads. The analysis repeats at location 312 of this new comparison, and in this iteration, a second read 314 is identified as a variant read.

Fig. 4 illustrates a technique for identifying missing errors. The compared positions 400 are analyzed again to determine the most frequent base calls at that position. In this example, three of the five strands have base calls T, one strand has base calls G, and one strand has base calls C. Thus, the most common base call is T, and the majority of the common bases 402 for that position in the read are T. The first chain 410 and the fourth chain are identified as variant reads.

Base calls in a look-ahead window 404 of the strands of non-variant reads are compared to determine a look-ahead window total 406. In this example, the values of the two base calls in the look-ahead window 404 are GA, and TG for the three non-variant reads. Thus, the most common base call series is GA, and this becomes the look-ahead window total 406.

The value of the base calls (AG) in the look-ahead window 408 of the first strand 410 is different from the look-ahead window consensus 406 (GA). Thus, the type of error responsible for the mismatch in the first chain 410 is not classified as a substitution. However, the base calls in the compared positions 400 and all positions except the final position of the look-ahead window 404(GA) match the look-ahead window consensus 406 (GA). Thus, the error type for that position of the first chain 410 is classified as missing. In this example, the look-ahead window 404 is 2 in length (w), so all positions except the final position of the look-ahead window 404 are w-1 or the first base of the look-ahead window 404. If the length (w) of the look-ahead window 404 is 3, then the first two bases (3-1) of the look-ahead window 404 will be considered when determining whether the type of error in the variant reads is a deletion. In the mathematical notation, if Yk[p[k]],...Yk[p[k]+w-1]In common with the look-ahead window, then Y will bekA mismatch in (b) is classified as missing.

After the error type of the variant read is classified, the position of comparison 400 is shifted one position to the right for each of the reads that are not variant reads and each of the reads for which the error is classified as replacement. For the first chain 410 classified as missing, the compared location 400 is not moved. It remains at the same G located in the fifth position of the first chain 410. This lack becomes apparent after realigning the differentially shifted chains to the newly compared position 412 for the first chain 410 (i.e., zero) and the other chains (i.e., 1), as shown in the lower portion of fig. 4. This realignment, which results from the error-type-based classification moving the newly compared location 412 by different amounts, keeps the chain in phase, further improving the analysis along the chain. After the newly compared location 412 is moved (or not depending on the type of error), the analysis may repeat, here to identify a mismatch in the fifth chain 414.

Fig. 5 illustrates a technique for identifying insertion errors. As discussed above, the three possible types of errors are substitutions, deletions and insertions. As with the identification of substitution and deletion errors, the identification of insertion errors begins with analyzing base calls in the compared positions 500 to determine a majority of common bases 502, and analyzing base calls in a look-ahead window 504 to identify a look-ahead window common 506. In this example, the majority of the common bases 502 are T. Fifth read 510 is a variant read in that it has an A at position 500 of the comparison instead of a T. The base call with a look-ahead window of 506 is GA. The base calls in the look-ahead window 508 of the fifth read 510 do not match the look-ahead window total 506, and therefore the error type is not classified as a substitution. The base call in position 500 and the first base call in the look-ahead window 508 for a variant read (AT) do not match the look-ahead window consensus 506(GA), so the error type is not classified as missing.

However, the base calls in the look-ahead window 508 of the fifth read 510 match the base calls of the majority of the common bases 502 and all base calls except the final base call (i.e., the w-1 position) of the look-ahead window total 506 (i.e., both are TG). Therefore, the error is classified as inserting a at the 5 th position of the fifth read 510. In the mathematical notation, if Yk[p[k]+1]Matches the majority of common bases 502, then Yk[p[k]+2],...Yk[p[k]+2],.. coincides with the first w-1 coordinate shared by the look-ahead window 506, then Y iskA mismatch in (b) is classified as an insertion. At the position to be comparedAfter the chain after the differential shift of 500 realigns to the position 512 of the new comparison, as shown in the lower part of fig. 5, the insertion error becomes apparent. The compared location 500 is advanced two locations for the read with the insert (here, the fifth read 510). For the other chains, position 500 of the comparison is advanced by one position for reads that are non-variant reads, by one position for reads that have a substitution, and by zero positions for reads that have a miss error.

The examples shown in fig. 3 to 5 each illustrate the analysis of only one type of error. However, the techniques of this disclosure are equally applicable to groups of reads having multiple errors in one compared location. There may also be multiple types of errors across the location of a single comparison, for example, in a total of 20 reads (m-20), three reads may have substitutions, one read may have deletions, and one read may have insertions.

Fig. 6 illustrates a case where the error type cannot be identified. Reads may have base calls in the compared positions that do not match most of the common base calls, and thus are variant reads, but base calls in the compared positions and the look-ahead window may not exhibit any relationship classified as substitution, deletion, or insertion. This is an error that cannot be classified according to the above technique and is therefore referred to as "indeterminate error" 604. Even if other techniques than those described in this disclosure are used to identify error types, errors that fail to match most common base calls that cannot be resolved to a particular type of error may be classified as indeterminate errors.

The base calls at the locations of the uncertainty errors 604 in the fifth reads 602 cannot be classified as substitutions, insertions, or deletions using the techniques shown in FIGS. 3-5. The compared position 600 immediately preceding the location of the indeterminate error 604 is the last position preceding the indeterminate error 604.

One way to handle reads with indeterminate errors is to discard the reads from further processing. Thus, if a read has an error and it cannot be resolved to a single error type, the read will be omitted from further analysis. If the read 602 is discarded, it may be indicated as "inactive trace" because the read 602 does not contribute any information to trace reconstruction. Once the reads 602 are identified as inactive tracks, a consensus output sequence is generated from the remaining active tracks 606. The active trace 606 may be all other reads from the same cluster, or may be less than all other reads if some are discarded due to inclusion of indeterminate errors, low confidence levels, or other reasons.

However, discarding the read entirely also discards useful information that may be contained in portions of the read that are free of indeterminate errors. For some sequencing techniques, such as nanopore sequencing, discarding each read with varying errors can significantly reduce the number of clusters recovered, as this may leave even few reads that cannot perform trace reconstruction.

Another way to handle ambiguity errors is to use bias or resolution in order to push the classification. The bias can be based on an error distribution of a polynucleotide sequencer used to generate the reads. For example, if the polynucleotide sequencer knows that substitution errors are generated much more frequently than deletion or insertion errors, all ambiguous errors can be classified as substitutions. If errors can be identified as one of two possible types of errors, the relative frequencies of those error types of polynucleotide sequencer techniques can be used to select between them. For example, if an error has been identified as a deletion or insertion (but not a substitution), and the polynucleotide sequencer has occurred for 80% substitution error, 15% deletion error, and 5% insertion error, the error may be classified as a deletion error, as deletion errors are more likely to occur than insertion errors in this example. However, this may reduce the reliability of the consensus output sequence, as potentially incorrect base calls may be included in the calculation of the consensus base call.

Additionally or alternatively, quality information of individual base calls can be used to classify ambiguity errors. In one implementation, when errors cannot be resolved to a single error type, all base calls in the compared locations and the look-ahead window with quality information indicating that the base call confidence is less than a threshold level may be omitted from the plurality of common base and look-ahead window commonalities. Thus, the common base call for the relevant position is determined at the most reliable base call of the plurality of reads. Ignoring low confidence base calls may result in the above technique being able to resolve errors into a single error type. However, if the confidence level of most base calls is low, or even if the confidence levels are consistent, then selecting a particular base call based on the confidence levels may not yield accurate results.

Fig. 7 illustrates a technique for skipping a portion of inactive trace 700 containing indeterminate errors to search for a match with active trace 702 at a later location further along inactive trace 700. Inactive tracking 700 may be labeled as inactive tracking due to uncertainty in the determination, as shown in fig. 6. The string of base calls represented by each of the reads in the cluster can be represented as Y1,Y2,...,Ym.. The value of m may vary based on sequencing techniques and depth of coverage. For example, using nanopore sequencing, m may be about 10 to 100. Inactive tracking 700 may be represented as Yj

A consensus output sequence 704 may be generated from portions of active traces 702 and inactive traces 700. Consensus output sequence 704 can be expressed as the best estimate of sequence X of a polynucleotide provided to an oligonucleotide sequencerAs previously discussed, consensus output sequence 704 identifies multiple consensus base calls while accounting for insertions, deletions, and substitutions to maintain alignment of reads relative to one another.

Location 706 is the location in inactive trace 700 that is estimated when the last read becomes inactive. This may be the comparative position 600 shown in fig. 6. This position may be denoted as i*. Then p*[j]Is YjIs the bit of the comparison when the last read becomes inactiveAnd (4) placing. A delay 708 of several positions (e.g., 5 to 10) is introduced in the inactive trace 700 before continuing the evaluation of the base calls in the inactive trace 700. Delay 708 is part of inactivity tracking 700, which will be maintained as inactive regardless of whether there are any potential matches for the region. The last "good" location in inactive tracking 700 is identified at location 706, but it is not known how many locations are affected by the indeterminate error. In sequences that include sudden errors (such as sequences from nanopore sequencing), there are likely to be several positions of mismatch that are adjacent or close to each other, which cannot be classified as an insertion, deletion, or substitution. Introducing delay 708 reduces the likelihood that inactive trace 700 will return to the active state at locations where there is a mismatch.

The schematic representation of inactive tracking 700 at the bottom of fig. 7 shows one illustrative arrangement of how inactive tracking 700 may be evaluated to identify matching sequences. When indeterminate errors are identified, portions of the inactive trace 700 may have been over analyzed and used to develop the consensus output sequence 704. This portion of the inactivity chase 700 is indicated as analysis sequence 710. This is followed by a delay 708, then a delay 708 that is not part of the active trace 700, which has a sequence 712 to be analyzed. Once how far the elapsed delay 708 is identified, the inactive trace 700 may be reactivated and trace reconstruction may continue.

After the delay 708 there is a location 714 of the comparison, which may be denoted as i. To determine whether the compared location 714 is within the delay 708 region, i is compared to i*And (3) comparison: if i-i*≦ delay, then Inactive Trace 700 is kept inactive, but if i-i*>Delayed, then the inactivity track 700 is evaluated at position i as described below.

A consensus look-ahead sequence 724 can be created from activity trace 702 for alignments of reads for which consensus output sequence 704 has not been calculated. By majority-sharing voting on activity trace 702 at the locations of the comparisons, a common look-ahead sequence 724 may be created in the same manner as the look-ahead window described earlier. This technique uses only the majority of the consensus base calls at each position as base calls for the consensus sequence. If there are an equal number of base calls (e.g., 5T and 5C), the linkage is broken arbitrarily. However, unlike the look-ahead window, the common look-ahead sequence 724 is not limited to two or three position windows.

Once the delay 708 area has passed, candidate location 716 is found. Candidate position 716 is a single position in inactive trace 700 that may be aligned with active trace 702 and used to continue determining consensus output sequence 704. Candidate location 716 may be denoted as k. Candidate location 716 may follow delay 708 at compared location 714 or may be located further after the end of delay 708.

If insertions and deletions occur with equal frequency from position 706 to candidate position 716, then the position at which inactive trace 700 can again be aligned with active trace 702 will be at or near compared position 714. The location 714 of the comparison may be denoted as k0. By combining candidate position i and position i*The difference i-i between*And the position of the comparison when the last read becomes inactive. Thus, k0:=p*[j]+(i-i*).. However, the number of insertions and deletions from position 706 to candidate position 716 may not be equal. Thus, search window 718 is used to evaluate multiple possible locations to achieve at k0Matching between subsequences of inactive traces 700 and active traces 702 in the surrounding region.

Search window 718 may be included at k0Preceding backward search window (sbw), over k0And k is the forward search window (sfw)0Itself. Thus, the search window 718 is adjacent k0K is a range of possible positions. The various possibilities for k can be expressed as k e k0-sbw,k0-sbw+1,...,k0-1,k0,k0+1,...,k0+sfw-1,k0+ sfw }. Each likelihood of k may be evaluated to determine if there is a subsequence of inactive tracks 700 that matches active tracks 702. The larger search window 718 resulting from the larger values of sbw and sfw makes it even largerIt is possible to find a match but also increase the likelihood of returning a false positive result. The size of the search window 718 may be determined experimentally. The total length of the search window 718 is sbw + sfw + 1. Thus, if sbw and sfw were both 5 positions long, the length of search window 718 would be 11 positions. sbw and sfw may be the same length as each other or different lengths.

The subsequences of the inactive trace 700 within search window 718 are compared to the compare sequence Z to determine if there is a match. The comparison sequence Z comprisesAnd may include one or both of a backward matching (mb)720 region and a forward matching (mf)722 region.

The backward match 720 region precedes position i and is aligned with a portion of consensus output sequence 704. The length of the backward matching 720 sequence may be determined experimentally and may range from 0 to 10 positions, for example. Thus, the backward matching 720 sequence may be omitted. The backward matching 720 sequence may also overlap the delay 708. At the position represented by delay 708, consensus output sequence 704 is determined by active trace 702, while base calls from inactive trace 700 are not used at the position of delay 708.

The first mb +1 positions of Z areThis therefore represents the same portion of the comparison sequence Z as the consensus output sequence 704 at the evaluated position. Even if the length of mb is 0,is also included in Z.

The forward match 722 sequence extends over position i, so there is no common output sequence 704 at the corresponding position. Thus, the consensus look-ahead sequence 724 sequence from active chase 702 was used for purposes of comparison with sequences in inactive chase 700. The length of the forward match 722 sequence may be, for example, 0 to 10 positions. Therefore, if mf is 0,then the forward matching 722 sequence is not used. While the backward matching 720 and forward matching 722 sequences are compared to the consensus sequence generated from the activity trace 702, each is compared to a different consensus sequence. Backward matching 720 sequence andthe consensus output sequence 704 is compared and the forward match 722 sequence is compared to a consensus look-ahead sequence 724.

The subsequence of inactive trace 700 compared to Z has a length mb + mf +1, which is the same as Z. Thus, matching is performed between two character strings of equal length. Recall that inactive tracking 700 may be represented as YjThus, the subsequence of inactive tracking 700 compared to Z can be represented asSince its location is based on the location of k. In particular, the amount of the solvent to be used,can be defined as Yj[k–mb]、Yj[k–mb+1]、…、Yj[k-1]、Yj[k]、Yj[k+1]、…、Yj[k+mf–1]、Yj[k+mf]。

By using the sum of Z andcan be defined for each k e k0-sbw,k0-sbw+1,...,k0-1,k0,k0+1,...,k0+sfw-1,k0+ sfw to determine if there are any candidate locations 716 representing a match. A match may not exceed a distance threshold (dt) in the edit distance between two strings. If dt is 0, then only an exact match is not considered a match. The number of comparisons may be the same as the length of the search window 718. Thus, if the search window is 12 positions long, a sliding window of 12 positions is moved along the search window 718 and sequences with the corresponding base calls in ZColumns are compared.

If none of the candidate locations 716 within the search window 718 are associated with a match to the corresponding Z subsequenceSubsequence, then inactivity track 700 may be kept inactive. The entire read may be discarded or only a portion of the read preceding the indeterminate error (i.e., the analysis sequence 710) may be used.

If there is an association within the search window that matches the corresponding Z subsequenceCandidate position 716 of the sub-sequence, the position of the comparison used for tracking the reconstruction may be set to the position immediately after candidate position 716. The location of the compared positions may be calculated as pj]: k + 1. Inactive trace 700 may again be designated as an active trace and trace reconstruction including the read may proceed forward from the compared location.

There may be multiple candidate locations 716 (multiple locations of k) where a match is found. If so, one of the candidate locations 716 is selected for use as a position fix to restart the tracking reconstruction. One way is to select the closest k0The position of (a). Alternatively, the distance k0The farthest position may be selected or the selection may be random.

A given read may have more than one indeterminate error and the process of identifying the indeterminate error and then finding the candidate location 716 to restart tracking reconstruction may be repeated multiple times in a single read. Further, a read cluster, which may include tens of sequences, may have multiple reads that include indeterminate errors. Thus, the technique may be used for more than one read (including all reads) in a cluster. Because at multiple locations, the set of reads that make up activity trace 702 may be different at different alignment locations for i different reads in the multiple reads, which may be active or inactive.

Figure 8 is a schematic of several DNA strands 800, 802, 804 indicating the location of alignment anchors. The alignment anchors provide reference positions in addition to the beginning and end of the reads, which can align multiple noisy reads to aid in the construction of a consensus output sequence.

In general, the accuracy of the consensus output sequence is more accurate towards the beginning or end of the read, with a higher confidence that there is a correct alignment with each other in the chain. Without being bound by theory, any errors may accumulate and cause other errors when analyzing subsequent locations along the read. For example, if a deletion error is incorrectly identified as a substitution error, the remaining base calls in the read may be out of phase and negatively impact the accuracy of subsequent common base calls. One way to minimize this effect is to use the first half of the "forward" analysis and the first half of the "reverse" analysis. For example, assume that the length of the read to be analyzed is 200 locations. A left-to-right analysis may be performed that provides a common output sequence for positions 1 through 100. A right-to-left analysis may also be performed which provides a common output for locations 101 to 200. Both analyses can be performed in parallel. The resulting consensus output sequence provides base calls for all positions 1 to 200 by combining left-to-right analysis and right-to-left analysis. This is called a combined consensus output sequence.

Because the combined consensus output sequence is the least reliable from the end of the read, one arrangement is to add at the alignment anchor 806 at the center of the DNA strand 800. This provides for realigning the locations of the reads at the positions least reliable relative to each other. The markers provide "anchors" for common alignments to reset and restart without being affected by previously accumulated errors.

Alignment anchors serve as markers for localization and should be identifiable in each read even in the presence of noise and errors created by sequencing. The alignment anchor may be a relatively short sequence of, for example, 4 to 8 positions. The choice of base sequence in the alignment anchor and the length of the alignment anchor can be arbitrary. Any sequence that is identical on multiple DNA strands that are clustered together can be used as an alignment anchor. Because the alignment anchor uses some number of bases, this comes at the cost of reducing the number of bases in the DNA strand that can be used to store digital information. However, including alignment anchors in the DNA strand may make it possible to use for trace reconstruction to generate accurate consensus output sequences with lower coverage levels, thereby reducing the total amount of sequencing required to recover the information stored in the DNA.

In implementations, sequences of alignment anchors are created to avoid duplicate bases and to avoid internal alignments. Because the alignment anchor is used to align multiple reads, it may be beneficial to reduce the likelihood of a misalignment between the alignment anchor sequences in two different reads. For example, ACGT is a sequence without any repeated bases and without any internal alignment. However, if the first ACG in one read is aligned with the last ACG in a different read, ACGACG can form a partial alignment with itself. For example, if there is an error in the base calls for a position within the alignment anchor, or the read sequence adjacent to the alignment anchor is identical to a portion of the alignment anchor, then a misalignment may occur.

One or more alignment anchors may be included in the polynucleotide sequence of a DNA strand. There are many possible arrangements of aligned anchors within a DNA strand. One arrangement is to place the alignment anchor 806 centered in the middle of the DNA strand 800. Placing the alignment anchor 806 in the center of the DNA strand 800 will of course result in the alignment anchor sequence being located in the middle or near the corresponding read. For example, the alignment anchor 806 may be located between 40% to 60%, 45% to 55%, 49% to 51%, or 50% of the distance from the beginning of the DNA strand 800 to the end of the DNA strand 800.

DNA strands 802 and 804 show examples of multiple aligned anchors. If multiple alignment anchors are used, they can be arranged in various ways. DNA strand 802 illustrates an example of an alignment equidistant to anchors 808, 810, and 812. Thus, the number of positions between the start of the DNA strand 802 and the alignment anchor 808, between the alignment anchor 808 and the alignment anchor 810, between the alignment anchor 810 and the alignment anchor 812, and between the alignment anchor 812 and the end of the DNA strand are all substantially the same. To align multiple reads, the alignment anchor can serve as an additional "end" of the DNA strand 802. Thus, if the DNA strand 802 is 200bp long, instead of creating a common output sequence of 100 positions from right to left and a common output sequence of 100 positions from left to right, the alignment anchors 806, 810, and 812 would divide the DNA strand 802 into four regions, each of about 50 bp. Thus, the sequencing length before the alignment point is reached is shortened from 200bp to about 50bp, and the chance of placing one of the reads as a wrong base call out of phase with the other is also reduced.

In a different arrangement, illustrated by DNA strand 804, the spacing of the alignment anchors may be biased toward the center of the DNA strand. Thus, alignment anchors 814, 816, and 818 near the center of DNA strand 804 are positioned closer together than alignment anchors 820 and 822 that are further from the center. Also, the anchors 820 and 822 may be further from the respective ends of the DNA strand 804 than the anchors. Each alignment marker used together should be unique so that it can be distinguished from other markers. Thus, anchors 814-822 will each have a different sequence. The length of each anchor may also be different.

FIG. 9 uses a schematic of reads to illustrate how partial consensus sequences generated from the read portions partitioned by alignment anchors can be joined together to create a complete consensus output sequence. The reads 900 illustrated in FIG. 9 may be generated, for example, from the DNA strands 800 illustrated in FIG. 8. Alignment anchor 902 may be located in the middle of read 900, approximately 50% of the way between the left and right ends of read 900. Even if the alignment anchor 902 is located exactly 50% along the length of the DNA strand, insertions and/or deletions introduced during sequencing may shift the position of the alignment anchor 902 in the reads, so that it is no longer exactly in the middle. Thus, alignment anchor 902 divides read 900 into two portions of approximately equal length.

Because the sequence of alignment anchor 902 is known (assuming no errors in this portion of read 900), a search for this known sequence (e.g., AGCT) can identify one or more matches in the entire read 900. The placement of alignment anchor 902 is also known, so any matches that are not within a threshold distance from the middle of read 900 can be ignored. Thus, the positioning of alignment anchor 902 can be identified in read 900 and other reads in the same cluster.

If read 900 is 180 positions long and alignment anchor 902 is four positions long and is located exactly in the middle, then the left subsequence of read 900 located to the left of alignment anchor 902 will be 88 positions long and the right subsequence located to the right of alignment anchor 902 will also be 88 positions long. The left-hand subsequence can be analyzed by forward and backward analysis to generate a left-to-right consensus output sequence 904 and a right-to-left consensus output sequence 906. Similarly, the right subsequence can be analyzed to generate consensus output sequences 908 and 910.

As discussed above, only the first half of each of the four consensus output sequences 904-910 may be used to minimize errors introduced by cumulative phase shifts caused by sequencing-induced insertions and deletions. In this implementation, only 44 positions will be analyzed before switching to a common output sequence running in the opposite direction. Alternatively, more than half of each of the respective consensus output sequences may be used, such that there is some degree of overlap near the middle of the sub-sequences. In the overlapping region, base calls from left-to-right and right-to-left consensus output sequences can be used to determine the consensus output base call to be assigned to a position.

The combination of left-to-right consensus output 904 and right-to-left consensus output 906 can be the left half of read 900 to form partial consensus 912. Similarly, consensus output sequences 908 and 910 can be combined in any suitable manner to generate partial consensus sequence 914 for the right half of read 900. Partial consensus 912 can be aligned with partial consensus 914 by reference to alignment anchor 902.

Once aligned, partial consensus 912 and partial consensus 914 can be combined using alignment anchor 902 to create a single combined consensus 916. Joint consensus 916 represents the consensus output of read 900, which was developed from forward and reverse consensus outputs 904-910. This is in contrast to implementations that do not use alignment anchors, where the consensus output sequence of read 900 is created from only one forward and one reverse consensus output sequence.

Because the alignment anchor 902 does not encode numerical information as does the rest of the read 900, the base calls corresponding to the alignment anchor 902 can be deleted to create a consensus output sequence without the alignment anchor 918. The consensus output sequence without alignment anchor 918 represents a base call string that can be decoded to recover the numerical information contained in the read 900.

FIG. 10 shows an illustrative block diagram 1000 of the tracking reconstruction system 102 shown in FIG. 1. Recall that the tracking reconstruction system 102 can be implemented in whole or in part in any of the computing device 112, the polynucleotide sequencer 110, and the server 114. Thus, the tracking reconstruction system 102 may be implemented in a system comprising one or more processing unit(s) 1002 and memory 1004, both of which may be distributed across one or more physical or logical locations. Processing unit(s) 1002 may include any combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a single-core processor, a multi-core processor, an Application Specific Integrated Circuit (ASIC), a programmable circuit such as a Field Programmable Gate Array (FPGA), or the like. In addition to hardware implementations, one or more of the processing unit(s) 1002 can be implemented in software and/or firmware. Software or firmware implementations of the processing unit(s) 1002 can include computer or machine executable instructions written in any suitable programming language to perform the various functions described. Software implementations of the processing unit(s) 1002 can be stored in whole or in part in the memory 1004.

Alternatively or additionally, the functionality of the trace reconstruction system 102 may be performed at least in part by one or more hardware logic components. By way of example, and not limitation, illustrative types of hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like. Implementations as hardware logic components may be particularly suitable for tracking portions of the reconstitution system 102 that are included as onboard portions of the polynucleotide sequencer 110.

The tracking reconstruction system 102 may include one or more input/output devices 1006, such as a keyboard, pointing device, touch screen, microphone, camera, display, speakers, printer, and so forth.

The memory 1004 of the tracking reconstruction system 102 may include removable, non-removable, local and/or remote memory devices to provide storage of computer readable instructions, data structures, program modules and other data. The memory 1004 may be implemented as a computer-readable medium. Computer-readable media include at least two types of media: computer-readable storage media and communication media. Computer-readable storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. As defined herein, computer-readable storage media and communication media are mutually exclusive.

The tracking reconstruction system 102 can be connected to one or more polynucleotide sequencers 110 through a direct connection and/or a network connection through the sequence data interface 1008. The direct connection may be implemented as a wired connection, a wireless connection, or both. The network connection may traverse the network 116. The sequence data interface 1008 receives one or more reads from the polynucleotide sequencer 110.

The trace reconstruction system 102 includes a number of modules that may be implemented as instructions stored in the memory 1004 for execution by the processing unit(s) 1002 and/or implemented in whole or in part by one or more hardware logic components or firmware.

The randomization module 1010 randomizes the input digital data prior to encoding the input digital data into DNA with the oligonucleotide synthesizer 104. The randomization module 1010 can create a random, more accurate pseudo-random string from the input digital data by performing an exclusive or (XOR) of the input string and the random string. The random string may be generated using a seeded pseudo-random generator based on a function and a seed. This randomization of the input digital data increases the randomness of the synthesized DNA strand 200, resulting in a noisy read 202 from the polynucleotide sequencer 110 itself with A, G, C and T pseudorandom sequences. Randomness supports coding (i.e., clustering and trace reconstruction).

The clustering module 1012 clusters the subset of the plurality of reads based on a likelihood that the subset of the plurality of reads are derived from the same DNA strand. The data received from the polynucleotide sequencer 110 at the sequence data interface 1008 can comprise a set of reads generated from each DNA strand provided to the polynucleotide sequencer 110. Although errors may exist in many reads, reads from the same DNA strand are generally more similar to each other than reads from another DNA strand. If the set of reads to be analyzed includes reads of different DNA strands, further analysis will be hindered. Thus, clustering may be performed to limit the data for further analysis to a subset of reads that are considered to represent the same DNA strand. Poorly formed clusters may be "poorly" formed by including too much or too little. A cluster that includes too many bad formations is one cluster that groups together reads of more than one chain in a single cluster. A poorly formed cluster that includes deficiencies is one of a plurality of clusters that should be grouped into a single large cluster but divided into a plurality of smaller clusters. The clustering module 1012 can use any suitable clustering technique, such as connectivity-based clustering (e.g., hierarchical clustering), centroid-based clustering (e.g., k-means clustering), distribution-based clustering (e.g., gaussian mixture model), density-based clustering (e.g., density-based spatial clustering with noisy applications (DBSCAN)), and so forth. The tracking reconstruction system 102 can analyze one or more, including all, of the clusters derived from the data output by the polynucleotide sequencer 110. As mentioned above, the number of reads in a cluster may depend on the depth of sequencing. With Nanopore sequencing, there may be approximately 10 to 100 reads per cluster.

The read alignment module 1014 aligns the plurality of reads at the location of the comparison across the plurality of reads. Initially, the left ends of the reads (corresponding to the 5' ends of the DNA strands) can be aligned. This first position in the read may be used as the position for the initial comparison. As the analysis proceeds, the read alignment module 1014 moves the compared locations along each read by a number of locations based on the identified error type.

The read alignment module 1014 advances the compared position by one position for reads having a majority common base call at the compared position, advances the variant read by one position if the error type is classified as replacement, advances the variant read by zero positions if the error type is classified as deletion, and advances the variant read by two positions if the error type is classified as insertion.

The read alignment module 1014 can also generate a "reverse" alignment that begins with reads aligned at the right end (corresponding to the 3' end of the DNA strand). Then, the analysis is performed in the same manner except that the movement to "right" is changed to the movement to left. When analyzed from left-to-right as compared to right-to-left, the common output sequence 204 may be different for the same set of reads.

In general, the accuracy of consensus output sequence 204 is more accurate towards the beginning of reads, where a higher confidence in the alignment is correct. Without being bound by theory, any errors may accumulate and cause other errors when analyzing subsequent locations along the read. For example, if a missing error is incorrectly identified as a substitution error, the remaining base calls in that read may be out of phase and negatively impact the accuracy of subsequent error identification. One way to minimize this effect is to use the first half of the "forward" analysis and the first half of the "reverse" analysis. For example, assume that the length of the read to be analyzed is 100 locations. A left-to-right analysis may be performed that provides a common output sequence 204 for the first 50 positions. A right-to-left analysis may be performed that provides a common output sequence 204 for the last 50 positions. Both analyses can be performed in parallel. The resulting consensus output sequence 204 is a combination of 50 base pairs identified by a left-to-right analysis and 50 base pairs identified by a right-to-left analysis. This is called a combined consensus output sequence.

The alignment anchor module 1016 functions by identifying alignment locations at locations other than read ends when alignment anchor sequences are present. When alignment anchors are present, the alignment anchor module 1016 can function with the read alignment module 1014 to create an alignment of the fragments of the read between one of the ends of the read and the alignment anchor and between the two alignment anchors (if present). The alignment anchor module 1016 can identify alignment anchor sequences present in the reads. The alignment anchor sequence is a predetermined sequence, such as illustrated in fig. 8 and 9. The alignment anchor sequence may be a 3 to 8bp sequence. In some implementations, the alignment anchor sequence can be 4 to 6bp long. The alignment anchor sequence may be any sequence included by design in the DNA strand synthesized by the oligonucleotide synthesizer 104. Thus, some nucleotide bases in a DNA strand are not used to encode numerical information, but are used to insert known sequence-alignment anchor sequences. To reduce the probability of a non-alignment between two aligned anchor sequences, the sequence used as the aligned anchor sequence can be designed without the need to repeat nucleotides so that it does not align with itself. For example, the sequence "AGCT" may be a suitable sequence for the alignment anchor, while the sequences "ACACAC" and "TTGG" will more likely not be aligned.

Once the alignment anchor is identified and the reads are aligned at the alignment anchor, the alignment anchor module 1016 may also divide the reads into a plurality of sub-reads, such as a first sub-read and a second sub-read. If there is only one alignment anchor sequence in the read, a first sub-read can extend from the beginning (e.g., the 5 'end) of the read to the alignment anchor sequence, and a second sub-read can extend from the alignment anchor sequence to the end (e.g., the 3' end) of the read. The alignment anchor module 1016 may do this for each of the reads in the analysis, creating a third sub-read, a fourth sub-read, and so on.

The read alignment module 1014 can then align the sub-reads in a similar manner as aligning the entire read, except that the alignment is based not only on the ends of the reads but also on the alignment anchors that divide the reads. Thus, the read alignment module 1014 can align a first subsequence from the first half of the first read with a third subsequence from the first half of the second read. The second half of the first read represented by the second subsequence can also be aligned with a fourth subsequence representing the second half of the second read. Thus, instead of aligning all of the first and second reads, subsequences of the reads are aligned. Of course, in many implementations, more than two reads will be aligned.

Because the subsequences are shorter than the entire read, the effect of accumulated errors in the generation of the consensus output sequence is reduced. There are a small number of locations where errors may cause one of the reads to be out of phase with the other reads. If errors do accumulate, or a read is out of phase with other reads in the same cluster, the alignment will "reset" at the alignment anchor sequence. The sub-reads may be aligned in the forward and reverse orientations, as described above for the entire read and as shown in FIG. 9. Thus, each alignment of subsequences (e.g., the first, second, third and fourth subsequences from the above example) can be analyzed in the same manner as an alignment of the entire sequence.

Variant read identification module 1018 determines a majority of the common base calls at the compared positions and labels reads having different base calls at the compared positions as variant reads. In some implementations, variant read identification module 1018 can use an error profile associated with the polynucleotide sequencer 110 discussed above to determine a majority of the common base calls. Variant read identification module 1018 can identify or otherwise identify each read that is a variant read for a given compared location. The tag then identifies the read as the read that should be identified for determining the error type. With each movement of the compared locations, the identity of which reads are variant reads and which are not.

The error classification module 1020 classifies the type of error for variant reads as substitution, deletion, or insertion. If the error cannot be uniquely classified, the error classification module 1020 may indicate that the type of error is uncertain or that the type of error is limited to one of two possibilities. The classification of errors by the error classification module 1020 may be based in part on a comparison of common strings of base calls (i.e., non-variant reads) and base calls in variant reads in a look-ahead window of a subset of the plurality of reads having a majority of common base calls at positions of the comparison. In determining the type of error, variant reads other than those analyzed in a given iteration are not used. If the error classification module 1020 classifies the error in the read as an indeterminate error, the variant read identification module 1018 may mark the read as an inactive trace.

As described above in fig. 3, the error type is classified as a substitution based on the common string of base calls in the look-ahead window that matches the base call string in the variant read after the compared position. As described above in fig. 4, the error type is classified as a deletion based on the consensus of the look-ahead window matching the compared positions in the variant reads and the base calls at one or more following positions.

As described above in fig. 5, the error type is classified as insertion. It is classified as an insertion because (1) the base calls in variant reads after the position of comparison (i.e., the first base call in the look-ahead window of variant reads) match the majority of the common base calls, and (2) the common string of base calls in the look-ahead window that match the string of base calls in variant reads is equal in length to the look-ahead window and begins two positions after the position of comparison. As described above in fig. 6, if an error type cannot be classified as a substitution, a deletion, or an insertion, it is classified as an indeterminate error.

Consensus output sequence generator 1022 determines consensus output sequence 204 based at least in part on the plurality of consensus bases and the type of error identified along the aligned reads. In view of the aligned alignment of reads due to error type classification, each position in consensus output sequence 204 is a plurality of consensus bases at that position. Error distribution of the polynucleotide sequencer 110 and/or quality information of individual base calls can also be used to determine the consensus output sequence 204 by affecting the identity of multiple consensus bases and error types.

Consensus output sequence generator 1022 can also generate and assemble consensus output sequences for alignment of subsequences defined by the alignment anchor sequences. Assembling a consensus output sequence for two or more subsequences of the read cluster can include generating a forward consensus sequence and a reverse consensus sequence for each of the subsequences shown in FIG. 9. The forward consensus sequence and the reverse consensus output sequence may be combined as described above to create multiple combined consensus sequences for each read cluster. The sequence of the alignment anchor(s) is used to append the combined consensus sequence together. For example, a first combined consensus sequence representing the first half of a read and a second combined consensus sequence representing the second half of a read can be appended to each other by alignment at the alignment anchor sequences. If multiple alignment anchors are present, multiple combined consensus sequences can be appended in the correct order by alignment at the respective alignment anchor sequences. After appending two or more combined consensus sequences together, the alignment anchor sequence(s) can be deleted to leave a string of combined consensus sequences as reads without alignment anchor sequences. A combined consensus sequence assembled from multiple subsequences may be more accurate than a combined consensus sequence made from the entire read length without alignment anchors, because the number of positions traversed by any forward or reverse consensus sequence (and the number of read opportunities that are out of phase with each other) is fewer. Reducing the number of cumulative errors may also reduce the coverage or sequencing depth required to form clusters that can produce accurate consensus sequences.

The error correction module 1026 may apply additional error correction techniques to decode the common output sequence 204. In some implementations, error correction module 1026 decodes common output sequence 204 using a non-binary error correction code based on the redundant data encoded into the chain. An example of such error correction is Reed-Solomon error correction. In an example implementation, a Reed-Solomon outer code may be added to the starting binary data and, when stored, ultimately distributed over many DNA strands (e.g., 10,000 and 100,000 strands). If the threshold number of errors is exceeded, it is possible that Reed-Solomon error correction may not be able to decode the common output sequence 204. If this happens, the tracking reconstruction may be repeated with one of the changed parameters. Changing the parameters may result in different common output sequences 204 that Reed-Solomon error correction is able to decode. The length (w) of the look-ahead window is one parameter that can be changed. A look-ahead window of length 3 may be used instead of a look-ahead window of length 2 (or vice versa). By making the threshold wider or tighter, the cut-off threshold used to label reads as inactive tracks, the length of the delay, acceptance of base calls based on quality information, and the error type classification of bias ambiguity errors can be varied. After changing one or more parameters, it may be determined whether the consensus output sequence 204 is different from the previous consensus output sequence 204, and if so, Reed-Solomon error correction may be applied to the new consensus output sequence 204 to see if it is able to decode the sequence.

The conversion module 1028 converts the common output sequence into binary data 208 representing at least a portion of the digital file. The conversion from a series of base calls to a string of binary data 208 is performed by reversing the operation originally used to encode the binary data 208 as a series of base calls. These operations are known to the entity operating the DNA repository 106. In some implementations, the converter 206 introduced in fig. 2 may include the same functionality as the conversion module 1028 and the error correction module 1026, as well as possibly other modules. Binary data 208 may be used in the same manner as any other type of binary data. If various error correction techniques are sufficient, the binary data 208 will represent a perfect reproduction of the original binary data.

Illustrative Process

For ease of understanding, the processes discussed in this disclosure are delineated as separate operations represented as separate blocks. However, these individually delineated operations should not be construed as necessarily order dependent in their performance. The order in which processes are described is not intended to be construed as a limitation, and any number of the described process blocks can be combined in any order to implement the process, or an alternate process. Also, one or more of the operations provided may also be modified or omitted.

Fig. 11A and 11B illustrate a process 1100 for correcting insertion, deletion, and substitution errors in sequence data generated by a polynucleotide sequencer. The process 1100 may be implemented by the tracking reconstruction system 102 shown in fig. 1, 2, and 10.

In 1102, binary data to be encoded into one or more DNA strands is reversibly randomized. This randomization occurs prior to DNA strand synthesis and, in some implementations, is present in all reads. The reads may be received via a sequence data interface 1008 as shown in figure 10. The reads may be randomized by the randomization module 1010 shown in fig. 10.

At 1104, the sequence data generated by the polynucleotide sequencer is clustered using a clustering technique. Any suitable clustering technique may be used, and one of ordinary skill in the art will be able to identify a suitable clustering technique. Clustering creates groups of reads derived from the same source DNA strand. Clustering may be performed by the clustering module 1012 shown in fig. 10. In some implementations, performing clustering on the randomized data improves the ability of the clustering technique to accurately separate multiple reads into different groups. A poorly formed cluster is one that contains reads derived from different DNA strands. Techniques such as discarding reads that deviate from the common sequence by more than a threshold amount may prevent poorly formed clusters from affecting the final common output sequence. However, discarding the entire read loses any useful information that may be obtained from the read.

In 1106, a plurality of reads classified as representing DNA strands are received for further analysis. Based on the clustering performed in 1104, the reads may be classified as representing the same DNA strand. Multiple reads can also be classified as representing the same DNA strand due to the sequencing technique that uses the input of a polynucleotide sequencer as only a single DNA strand (or substantially identical copies produced by PCR). In some implementations, multiple reads may be received via the sequence data interface 1008 shown in fig. 10. In other implementations, the plurality of reads may be received after the clustering performed by the clustering module 1012.

In 1108, the location of the comparison across multiple reads is identified. The compared position may be similar to the compared position 300 shown in fig. 3, the compared position 400 shown in fig. 4, the compared position 500 shown in fig. 5, or the compared position 600 shown in fig. 6. In an implementation, the location of the comparison may be identified by the read alignment module 1014 shown in FIG. 10.

At 1110, the majority of the consensus base calls at the compared positions are determined by identifying the most common base call at that position. The link can be arbitrarily broken. As described above, the most common base calls can be identified in part by considering the quality information of the base calls present at the compared positions. In an implementation, the majority of the consensus base calls can be determined by variant read identification module 1018 shown in fig. 10.

In 1112, it is determined whether the base call at the compared position is the same as the majority of the common base calls. If so, then the read being analyzed has the expected base call at that location, not a variant read relative to that location, and the process 1100 follows the "YES" path to 1114.

At 1114, process 1100 advances to the next location along the read. However, if the base call at the compared position does not match the majority of the consensus base calls, the process 1100 follows the "no" path from 1112 to 1116.

In 1116, reads from a plurality of reads in which the base calls in the compared positions differ from the majority of the common base calls are identified as variant reads. In an implementation, this identification may be performed by variant read identification module 1018 shown in FIG. 10.

Moving to FIG. 11B, at 1118, the common string of base calls in the look-ahead window adjacent to the compared position is compared to the base calls in the variant reads. The common string of base calls in the look-ahead window may be limited to base calls from a subset of reads having a majority of common base calls at the positions compared. For example, in a collection of 10 or 20 reads, there may be more than one variant read because the base calls at the compared positions do not match the majority of the common base calls. When the comparison is made for one of the variant reads, the base calls in the look-ahead window of the other variant reads are not considered. This is because other variant reads may have deletion or insertion errors, which will cause the base calls in the look-ahead window to be out of phase and possibly incorrect. Based at least in part on the comparison, a type of error may be determined for the variant reads. In an implementation, this comparison may be performed by the error classification module 1020 shown in fig. 10. In one implementation, the look-ahead window may be 2 to 4 positions in length.

In 1120, the error type of the variant reads is determined to be a substitution based on the common string of base calls in the look-ahead window being the same as the string of base calls in the look-ahead window after the compared position of the variant reads. Thus, the look-ahead window of variant reads matches the look-ahead window of non-variant reads. This relationship is shown, for example, in fig. 3.

In 1122, the position of the comparison of variant reads is advanced by one position.

In an alternative, in 1124, the common string based on the base calls in the look-ahead window is the same as the base call string in the variant reads that include the base call in the compared position and the adjacent base call, and the error type of the variant read is determined to be missing. The length of the base call string in the variant reads is equal in length to the length of the look-ahead window. Thus, for example, if the look-ahead window is three positions in length, the base call string in the variant read includes the base call in the compared position and the next two base calls. This relationship is shown, for example, in fig. 4.

In 1126, the compared positions of variant reads are advanced by zero positions. Because there are deletions, not advancing the position of the comparison of variant reads will re-compare the chains, so that the chains will be in phase for subsequent analysis.

As yet another alternative, in 1128, the type of error for the variant read is determined as an insertion based on the base calls that match the two particular patterns. First, the base calls in variant reads after the position of comparison are identical to most common base calls, and second, the common string of base calls in the look-ahead window is identical to the base call string in the variant read sequence. The base call string in the variant read sequence is equal in length to the look-ahead window, and the starting position of the base call string is two positions after the compared position. This relationship is shown, for example, in fig. 5.

In 1130, the compared positions of variant reads are advanced by two positions. The position of the comparison is advanced by one position to account for the insertion and by a second position, since the position of the comparison is advanced by one position for all non-variant chains. This maintains alignment between the chains for subsequent analysis.

In each of 1120, 1124, and 1128, a type of error for the variant read at the location of interest can be determined based at least in part on an error profile associated with the polynucleotide sequencer. Considering the error distribution of a polynucleotide sequencer can alter one or both of the majority of common base calls and the determination of the common string of base calls in the look-ahead window. In an implementation, the consideration of the error distribution may be performed by the common output sequence generator 1024.

In 1132, it is determined whether the variant read is less than a threshold reliability level. The threshold level may be the number of errors in the variant reads; the number of errors in variants that cannot be uniquely classified; a minimum, median, or pattern of confidence levels for base calls in variant reads; or other factor(s). The threshold number may be a number of locations in the variant reads from one location to the total number of locations. The threshold number may also be a percentage from 1% to 100%. If the variant read is less than the threshold reliability level, process 1100 proceeds along the YES path to 1134.

In 1134, variant reads are omitted and a single consensus output sequence from multiple reads is used without the variant reads. After omitting variant reads, process 1100 proceeds to 1136. Alternatively, if the variant read is not less than the threshold reliability level (i.e., deemed reliable), the variant read is used for further analysis and process 1100 proceeds along the no path to 1136. If a variant read cannot be classified as an insertion, deletion or substitution due to the presence of indeterminate errors, the variant read is classified as such, rather than being discarded, which can be further analyzed as shown in FIG. 13 and/or FIG. 14 below.

At 1136, a determination is made as to whether there are additional unanalyzed locations in the read. Thus, it is determined whether the "end" of the read has been reached, and the majority of the common base calls have been identified for the position of the read. The end of a read may also be identified by comparing anchors if used to break the read into multiple smaller segments. If the analysis has not reached the end, process 1100 proceeds along the YES path to 1138.

In 1138, the positions of the subset of the plurality of reads having the most common base calls at the compared positions (i.e., the non-variant reads) are advanced by one. The position of the new comparison of variant reads is advanced by an amount in 1122, 1126, or 1130 based on the identified type of error. The new compared locations may be similar to the new compared locations 312, 412, 512, and 608, as shown in fig. 3-6. The process 1100 then returns to 1108 and analysis continues.

At 1136, if there are no unanalyzed locations in the read, process 1100 proceeds along the "no" path to 1140.

In 1140, a single consensus output sequence is determined based in part on the majority of the consensus base calls and the type of error. The single common output sequence may be determined by common output sequence generator 1024 shown in fig. 10.

Fig. 12A and 12B illustrate a process 1200 for recovering binary data encoded in a synthetic DNA strand by accounting for insertion, deletion, and/or substitution errors. The process 1200 may be implemented by the tracking reconstruction system 102 shown in fig. 1, 2, and 10.

In 1202, binary data to be encoded into DNA is reversibly randomized by taking the exclusive OR (XOR) of the binary data and a random sequence generated by a seed sum function. This operation affects the DNA strands which, when read, produce reads that also have randomized properties. In an implementation, the randomization may be performed by the randomization module 1010 shown in fig. 10.

In 1204, a plurality of reads is received from a polynucleotide sequencer. In an implementation, multiple reads may be received by the sequence data interface 1008 shown in FIG. 10.

In 1206, the plurality of reads are clustered into a plurality of clusters by sequence similarity. Similar sequences are likely to result from sequencing of the same DNA strand (the sequences are not identical due to errors introduced by the polynucleotide sequencer). Thus, one cluster should represent all reads from the same DNA strand. Recall that a polynucleotide sequencer can sequence multiple different DNA strands simultaneously, and thus the raw output of sequence data from a polynucleotide sequencer can include reads corresponding to multiple different DNA strands. In an implementation, clustering may be performed by the clustering module 1012 shown in FIG. 10.

In 1208, a cluster is selected from the plurality of clusters. The cluster contains a clustered set of reads. If the clustering is accurate, all reads in the clustered set of reads are from sequencing of the same DNA strand. At this point, clusters are identified only by the nature of having reads clustered together before additional analysis is performed. Thus, in some implementations, each cluster is analyzed in turn, and the order in which individual clusters are selected may be arbitrary. Multiple ones of the clusters may also be analyzed in parallel. In an implementation, the selection of the clusters may be performed by the clustering module 1012 shown in FIG. 10. The analysis may continue until trace reconstruction is performed on all clusters from the plurality of clusters.

In 1210, the cluster sets of reads are aligned at the locations of the comparisons of the cluster sets across the reads. In an implementation, the location of the comparison may be the first location shared between the cluster sets of reads. Thus, the original alignment may define the location of the first comparison. The first position may be the leftmost position (corresponding to the 5 'end) or alternatively may be the rightmost position (corresponding to the 3' end). In an implementation, the alignment may be performed by the read alignment module 1014 shown in FIG. 10.

At 1212, the majority of the consensus base calls at the positions of the first comparison are determined. The majority of the consensus base calls are based at least in part on the most common base calls across the collection of clusters of reads. The majority of consensus base calls may also be based in part on an error profile associated with the polynucleotide sequencer. That is, base calls can be weighted based on the associated error distribution (e.g., more of certain base calls are more valuable and fewer of certain base calls have less impact on determining the majority of common base calls).

At 1214, variant reads from the clustered set of reads are identified. Variant reads have a base call at the position of comparison that is different from the majority of common base calls. In an implementation, variant reads may be identified by variant read identification module 1018 shown in FIG. 10.

Moving now to FIG. 12B, in 1216, the common strings of base calls in the look-ahead window are identified. The consensus string is based on base calls of a subset of the clustered set of reads that have most of the consensus base calls (i.e., non-variant reads) at the positions compared. The look-ahead window is adjacent to the location of the comparison. In some implementations, the look-ahead window may be two or three locations long.

In 1218, the type of error for the variant reads at the location of the comparison is determined to be a substitution based at least in part on the base call in a look-ahead window of variant reads that match the common string of base calls. An example of this relationship is shown in fig. 3. In an implementation, the error type may be determined by the error classification module 1020.

At 1220, the compared positions of the variant chains are moved forward by one position.

In 1222, based at least in part on the series of base calls in the variant read (including the base call at the compared position) and one or more base calls following the compared position that match the common string of base calls, it is determined that the error type of the variant read at the compared position is a deletion. An example of this relationship is shown in fig. 4. In an implementation, the error type may be determined by the error classification module 1020.

In 1224, the compared positions of the variant chain are moved forward by zero positions.

In 1226, the type of error for the variant read at the position of the comparison is determined to be an insertion. Insertion errors are identified based at least in part on base calls in variant reads that follow the compared position that matches the majority of the consensus base calls and a series of base calls in variant reads that begin two positions after the compared position that matches the consensus string of base calls. An example of this relationship is shown in fig. 5. In an implementation, the error type may be determined by the error classification module 1020.

In 1228, the compared positions of the variant chain are moved forward by two positions.

In 1230, the compared locations of the reads in the subset of the clustered set of reads (i.e., non-variant reads) are forward advanced by one location.

In 1232, a single consensus output sequence is determined by a clustered set of reads. In an implementation, a single common output sequence is generated by common output sequence generator 1024 shown in FIG. 10.

In 1234, the single common output sequence is converted to binary data. This may be the final manipulation of information derived from the DNA strand before being used again as a digital computer file. In an implementation, the change from sequence data to binary data may be performed by the translation module 1028 shown in FIG. 10.

FIG. 13 illustrates a process 1300 for identifying a portion of a read that contains a bursty error and generating a consensus output sequence using the portion of the read other than the portion containing the bursty error. As an alternative to discarding the entire read, process 1300 may be performed along with process 1100 or 1200. As described above, reads may be generated from DNA strands storing digital data.

In 1302, a start of a portion of a read containing a bursty error is identified. The start of the portion of the read containing the burst error may be identified by detecting the error itself. For example, in reads that do not match the common output sequence (variant reads), locations that are not classified as insertions, deletions, or substitutions may be interpreted as the beginning of the portion of the read that contains the bursty error. An example of identifying uncertainty errors is shown in fig. 6.

In 1304, the end of the position fix containing the bursty error is identified. Identifying the end of a location containing a bursty error will find a location in the read that exceeds the location of the bursty error, so that further analysis of the read down will find a sequence that can be aligned with other reads in the cluster. The end of the burst error is identified by a candidate location in the read. The candidate locations flank one or both of the backward matching region (mb) or the forward matching region (mf). As described above, the backward matching region matches a consensus sequence generated from other reads in the plurality of reads in the same cluster, and the forward matching sequence matches a sequence generated by majority voting from sequences of other reads in the cluster.

At 1306, the portion of the read that contains the bursty error is omitted from the generation of the consensus output sequence. Thus, the corresponding position of the read between the position identified as the start of the bursty error and the position identified as the end of the bursty error does not contribute to determining the common output sequence. For these positions in the consensus output sequence, base calls are determined based on some or all of the other reads in the cluster using techniques such as those illustrated in fig. 11 and 12.

At 1308, a consensus output sequence is generated using portions of the read from either side of the portion containing the bursty error. Thus, even if a portion of the reads are not used to generate the common output sequence, other portions of the reads are used. By comparing the values of the multiple reads at the locations of the comparison, while aligning the multiple reads relative to one another based on insertions, deletions, and substitutions, the portions of the reads on either side of the portion with the bursty error are used to generate a consensus output sequence. The common output sequence may be generated by common output sequence generator 1024.

FIG. 14 illustrates a process 1400 for determining whether a read with an indeterminate error can "bring back" by finding a matching sequence of reads down. As an alternative to discarding the entire read if the read includes an indeterminate error, process 1400 may be performed along with process 1100 or 1200. The indeterminate error may be a bursty error that is not classified as any of an insertion, deletion, or substitution.

At 1402, an indeterminate error is identified in a read that is one of a plurality of reads of a polynucleotide sequencer. The plurality of reads may be a group of reads in the same cluster as formed, for example, by the clustering module 1012. The indeterminate error is the error relative to the common output sequence of the multiple reads. The common output sequence may be generated by common output sequence generator 1024.

The uncertainty error may be identified as shown in fig. 6. Additionally, the identification of errors as indeterminate errors may be performed by the error classification module 1020. The indeterminate error is located at a first position in the read. For example, the indeterminate error may be located at position 100 in a read having a total length of 200 base pairs. Thus, a common output sequence may be identified for positions 1 through 99 before an indeterminate error is determined to exist at position 100.

At 1404, a search window is defined that includes a second location in the read that is located at least one distance delayed beyond the first location. In an implementation, the second position is represented by k0The position of the representation. In an implementation, the length of the delay may be between 4 and 10 positions. For example, if the delay distance is 8 positions, then the search window is the window that includes position 108 in this example read. The search window may contain positions whose base calls are error free.

Limiting the search to the search window rather than the entire remainder of the read focuses the search for matching sequences on the near vicinity where matching sequences (if present) should be found. Shorter search windows reduce the likelihood that a matching sequence will be identified, but longer search windows increase the probability of false positive matches. In an implementation, the length of the search window may be between 9 and 20 positions. As described above, the search window may include two parts: for k0Backward search window of previous positionPort (bsw) and for k0Forward search window (fsw) of the following position. bsw and fsw may be the same length as each other. For example, both bsw and fsw may be about 5 to 8 positions long. Thus, the search window includes bsw, k0And fsw.

At 1406, an edit distance is calculated for the subsequence in the search window determined from the comparison sequence. The comparison sequence may comprise a backward matching sequence (mb), which is a common output sequence at corresponding positions in some or all of the other reads in the plurality of sequence reads. The length of the backward matching sequence may be about 1 to 11 positions.

A consensus output sequence of multiple sequence reads may be determined by majority voting to identify the considered positions, taking into account insertion, deletion, and substitution errors, and proceeding sequentially from the currently considered base call position to an adjacent base call position. Thus, all other reads from the same cluster (or all other reads that also have no indeterminate error at that location) may be used to generate a common output sequence as described above.

The comparison sequence may also include a forward matching sequence (mf). As described above, the forward matching sequence is adjacent to the backward matching sequence and is positioned beyond the position of the backward matching sequence (i.e., away from the position of the indeterminate error). The forward match sequence corresponds to a portion of the aligned reads for which a consensus output sequence has not been determined. Because the consensus output sequence has not been determined for the positions included in the forward matching sequence, the majority consensus vote from base calls of other reads in the cluster is used to determine whether there is a match. Most consensus voting simply identifies the most frequent base calls at each position without attempting to account for errors such as insertions, deletions, and substitutions. If a forward matching sequence is included, the length of the forward matching sequence is 1 to 10 positions.

In an implementation, a sliding window may be moved over a search window, and sequences within the sliding window may be used as subsequences to be compared to a common output sequence. The edit distance may be calculated for each of the positions of the sliding window. Depending on its position, the sliding window may include relatively more or relatively fewer backward matching sequences and forward matching sequences.

In 1408, it is determined whether a match is found between any of the subsequences at the corresponding position and the common output sequence. A match is identified by the edit distance being less than a threshold. For example, if the threshold is 0, then only a perfect match is considered a match. If the threshold for edit distance is 1, then two sequences with a single base call can be classified as a match.

If a match is not found, process 1400 proceeds along the NO path to 1410 and the read is deleted from consideration. The lack of a match indicates that there are no subsequences within the search window that match the common output sequence/majority of the common votes. Thus, the read may have a large number of errors, and the most accurate common output sequence may be obtained by ignoring the read.

However, if a match is found, process 1400 proceeds along the YES path to 1412. In 1412, it is determined whether multiple matches are found. Within the search window, there may be more than one subsequence matching the corresponding base calls of the common output sequence. If this is the case, process 1400 proceeds along the YES path to 1414.

In 1414, one of the matching subsequences is selected. Among the two or more matching subsequences, one subsequence may be selected by any of a variety of techniques, such as random selection, selection of the one of the subsequences closest to the beginning of the read (e.g., the 5' -end), selection of the one of the subsequences furthest from the manufacture of the read, or by another criterion.

In 1416, individual sub-sequences in the search window having an edit distance less than a threshold are identified. This is a matching subsequence, indicating that after the localization of the indeterminate error, there are portions of the reads that can be aligned again with other reads in the cluster.

In 1418, candidate locations within the matching subsequence are selected. The length of the matching subsequence may be about 5 to 15 positions, and one of these positions is selected to be used as a candidate position. The candidate position may be the same as the candidate position (k) shown in fig. 7. In an implementation, the candidate position may be the position in the backward matching sequence that is farthest from the position where the indeterminate error starts. Thus, if there is no forward matching sequence, the candidate position may be a position within the matching subsequence at the end of the matching subsequence. If a forward matching sequence is present, the candidate position may be a position in the backward matching sequence that is immediately adjacent to the forward matching sequence.

In 1420, the compared position is set to one position beyond the candidate position. The compared position may be the same as the compared position 600 shown in fig. 6. Thus, the locations and candidate positions of the matching subsequences are used to determine how to realign the read with other reads after an indeterminate error. The position of the comparison is located at a third position in the read, which position can be indicated as being located at k +1 using the notation introduced above.

At 1422, a common output sequence from the multiple reads is determined at a third location. The location of this comparison is used to compare reads in the cluster to other reads to determine a consensus output sequence as described above. The analysis of the reads may continue until the end of the read is reached, and any variations in the reads from the output sequence may be handled as described earlier in this disclosure.

Many of the parameters used to identify matching subsequences and ultimately determine the position of the comparison (such as delay, bsw, fsw, mb, and mf) have lengths that can vary. The length of a given set used to analyze sequence data can be determined experimentally by testing different values for each of delay, search window, mb, and mf. Each may vary over a range of values, and different combinations may be tested.

For each combination of lengths, the number of clusters that can be generated from reads produced by the polynucleotide sequencer is determined. Generating a large number of clusters indicates that more reads can be used, even with variable errors, and that more data output by the polynucleotide sequencer helps generate a consensus output sequence. Thus, the combination of values that results in the maximum number of read clusters from the plurality of sequence reads may be used as the length of the parameter.

Examples of the invention

The technology described in this disclosure was tested on three different files generated by nanopore sequencing, with sizes of 32kB, 115kB, and 1500 kB. Each of these files contains digital information encoded in the sequence of DNA strands. This is not synthetic data, but an actual computer file encoded in DNA. Thus, the qualitative testing of this technique is, in part, a test of the ability to successfully recover a file from a DNA strand. The quantitative measure is the number of clusters correctly recovered from the output of the polynucleotide synthesizer and the depth of coverage of the sequencing.

This novel technique of ignoring portions of a read that contain indeterminate errors and using portions that do not contain indeterminate errors ("the new technique") compares with earlier techniques that attempted to classify errors as insertions, deletions, or substitutions and discard the entire read from further consideration if the error could not be classified as one of these three types ("the old technique"). For each of the two smaller files, the new technique yields better results for recovery and coverage measurement by the cluster.

The number of correctly recovered clusters indicates the number of unique DNA strands for which the analysis can generate a consensus sequence. The total number of clusters formed includes clusters that do not produce consensus sequences. Thus, restoring a greater number of clusters indicates that the sequence is restored from a greater number of DNA strands. The comparison is presented in table 1 below.

TABLE 1. Cluster recovery comparison between two different techniques for trace reconstruction from multiple noise reads.

Thus, the new technique correctly restores 2% to 4% more clusters. The largest file of 1500kB cannot be decoded using the old technique (and therefore there are no compared cluster recovery values), but has been successfully decoded using the new technique. By trying various values for delay, search window forward (swf), search window backward (swb), backward match (mb), and forward match (mf), approximately 5000 different combinations of parameters are tested to see which combination would generate the maximum number of correct clusters. For all test results shown in table 1, the distance threshold (dt) is set to zero, so an exact match is required. swf and swb are set to the same values as indicated in the sw (f/b) column. The parameters that yield the maximum number of correct clusters for the 32kB file and the 115kB file are shown in table 2 below. Four different parameter combinations for successfully decoding a 1500kB file are also shown in table 2.

File size (kilobyte) Delay Sw(f/b) mb mf
32 8 8 7 0
115 6 5 1 8
1500 5 5 5 5
1500 8 8 5 5
1500 8 8 8 0
1500 8 8 0 8

Table 2 comparison of parameter values for the new technique to successfully decode the file and maximize the number of clusters correctly recovered.

The delay length is 5 to 8 for the file under test, and the search window (forward and backward) is also 5 to 8. As indicated by the example of length 0, the forward or backward matching regions may be omitted. However, for all the examples shown in table 2, the combined length of the backward matching region and the forward matching region is 7 to 10.

In addition, the new technology can recover files with lower coverage levels. For a 32kB file, it was successfully decoded with 22 × coverage using the new technique, but the old technique failed to decode it with 38 × coverage. The 115kB file was successfully decoded with 27 x coverage, but 32 x coverage was insufficient using the old technique. For a large file of 1500kB, it was successfully decoded with 27.2 × coverage. Tests performed with random subsamples of 1500kB files determined that it could be recovered with coverage as low as 23.1 x. With this new technique, all test samples can be restored with coverage of 27 x or less, whereas old techniques require coverage depths well in excess of 30 x when a file can be successfully restored. The reduced coverage allows the decoding of digital information stored in DNA with less sequencing and less expense than higher coverage levels.

Illustrative embodiments

The following clauses describe a number of possible embodiments for implementing the features described in this disclosure. The various embodiments described herein are not limiting and are not every feature from any given embodiment that is claimed to be present in another embodiment. Any two or more embodiments may be combined together unless the context clearly dictates otherwise. As used herein, in this document, "or" means and/or. For example, "A or B" refers to A without B, B without A, or A and B. As used herein, "comprising" means including all of the listed features, and may include the addition of other features not listed. "consisting essentially of means including the listed features as well as those additional features that do not materially affect the basic and novel characteristics of the listed features. "consisting of …" merely means the listed features and excludes any features that are not listed.

Clause 1. a method of generating a consensus sequence from a plurality of reads of a deoxyribonucleic acid (DNA) strand storing digital data, the plurality of reads generated with less than 30 x coverage by a sequencing technique that introduces a sudden error into a read of the plurality of reads, the method comprising: omitting a portion of the reads that contain a burst error from the generation of the common output sequence; and generating a consensus output sequence using portions of the read from either side of the portion of the read containing the bursty error.

Clause 2. the method of clause 1, wherein the sequencing technique is nanopore sequencing.

Clause 3. the method of any one of clauses 1 to 2, wherein the plurality of reads are generated with a coverage of less than 25 x.

Clause 4. the method of any one of clauses 1 to 3, further comprising: the beginning of the portion of the read that contains the bursty error is identified by identifying a location in the read that does not match the common sequence and is not classified as an insertion, deletion, or substitution.

Clause 5. the method of any one of clauses 1 to 4, further comprising: identifying an end of a portion of the read that contains the bursty error by identifying a location in the read that is flanked by at least one of: a backward matching region that matches a consensus sequence, or a forward matching region that matches a sequence generated by majority voting from sequences of at least two other reads of the plurality of reads.

Clause 6. the method of any one of clauses 1 to 5, wherein the consensus sequence is generated by: the values for the multiple reads at the compared locations are compared while the multiple reads are aligned to each other based on insertions, deletions, and substitutions.

Clause 7. a computer-readable medium encoding instructions that, when executed by a processing unit, cause a computing device to perform the method of any of clauses 1-6.

Clause 8. a system comprising a processing unit configured to implement the method of any of clauses 1-6 and a memory.

Clause 9. a method, comprising: identifying an indeterminate error in a read of a plurality of sequence reads from a polynucleotide sequencer relative to a consensus output sequence of the plurality of sequence reads, the indeterminate error being located at a first position; defining a search window comprising a second location in the read, the second location being located at least one distance delayed beyond the first location; calculating an edit distance for the subsequence in the search window and a comparison sequence, the comparison sequence including a backward matching sequence that is a common output sequence of corresponding positions in at least two of the plurality of sequence reads; identifying a subsequence in the search window having an edit distance less than a threshold; selecting candidate locations within the subsequence based on the length of the backward matching sequence; setting a position of the comparison with at least two other reads from the plurality of sequence reads as a third position in the read, the third position being one position beyond the candidate position; and determining a consensus output sequence from the plurality of sequence reads at the third position.

Clause 10. the method of any one of clauses 9, wherein the indeterminate error is a sudden error that is not classified as any one of an insertion, a deletion, or a substitution.

Clause 11. the method of any one of clauses 9 to 10, wherein the consensus output sequence of the plurality of sequence reads is determined by: the majority vote for the considered position is identified while accounting for insertion, deletion, and substitution errors, and the consensus output sequence of the plurality of sequence reads proceeds sequentially from the currently considered base call position to an adjacent base call position.

Clause 12. the method of any one of clauses 9 to 11, further comprising: a second sub-sequence in the search window having an edit distance less than a threshold is identified and the sub-sequence or one of the second sub-sequences that is closest to the candidate position is selected.

Clause 13. the method of any one of clauses 9 to 12, wherein the threshold value for the edit distance is 0.

Clause 14. the method of any one of clauses 9 to 13, wherein comparing the sequences further comprises a forward matched sequence that is a majority consensus vote for a base call from at least two other reads from the plurality of sequence reads.

Clause 15. the method of clause 14, wherein the length of the delay is between 4 and 10 positions, the length of the search window is between 9 and 20 positions, the length of the backward matching sequence is between 1 and 11 positions, and the length of the forward matching sequence is between 1 and 10 positions.

Clause 16. the method of clause 15, wherein the sum of the length of the backward matching sequence and the length of the forward matching sequence is between 5 and 15 positions.

Clause 17. the method of clause 15 or 16, wherein the length of the delay, the length of the search window, the length of the backward matching sequence, and the length of the forward matching sequence are each experimentally determined by: different values for each length are tested and a combination of values is selected that results in the maximum number of read clusters being recovered from the plurality of sequence reads.

Clause 18. a computer-readable medium encoding instructions that, when executed by a processing unit, cause a computing device to perform the method of any of clauses 9-17.

Clause 19. a system comprising a processing unit configured to implement the method of any of clauses 9 to 17 and a memory.

Clause 20. a system for alignment of sequence reads, comprising: one or more processing units; a memory coupled to the one or more processing units; a comparison anchor module stored in the memory and implemented by the one or more processing units to: identifying an alignment anchor sequence having a predetermined sequence, the predetermined sequence being present in a first sequence read and a second sequence read, dividing the first sequence read into a first sub-read and a second sub-read, the first sub-read extending from a beginning of the first sequence read to the alignment anchor sequence, the second sub-read extending from the alignment anchor sequence to an end of the first sequence read, dividing the second sequence read into a third sub-read and a fourth sub-read, the third sub-read extending from the beginning of the second sequence read to the alignment anchor sequence, the fourth sub-read extending from the alignment anchor sequence to an end of the second sequence read; and a read alignment module stored in the memory and implemented on the one or more processing units to: the first sub-read is aligned with the third sub-read based on the start of the first sequence read, the start of the second sequence read, and the alignment anchor sequence, and the second sub-read is aligned with the fourth sub-read based on the alignment anchor sequence, the end of the first sequence read, and the end of the second sequence read.

Clause 21. the system of clause 20, wherein the alignment anchor sequence has a sequence that is not aligned to itself.

The system of any of clauses 20-21, wherein the alignment anchor sequence is located between 45% and 55% of the distance from the beginning of the first sequence read to the end of the first sequence read and between 45% and 55% of the distance from the beginning of the second sequence read to the end of the second sequence read.

Clause 23. the system of any one of clauses 20 to 22, further comprising an oligonucleotide synthesizer configured to read the first polynucleotide molecule with a first sequence comprising only one instance of the aligned anchor sequences and synthesize a second polynucleotide sequence, wherein the second polynucleotide sequence comprises only one instance of the aligned anchor sequences.

Clause 24. the system of any one of clauses 20 to 23, further comprising a common output sequence generator module stored in the memory and implemented on the one or more processing units to: generating a first forward consensus sequence from the first sub-read and the third sub-read; generating a first reverse consensus sequence from the first sub-read and the third sub-read; combining the first forward consensus sequence and the first reverse consensus sequence into a first combined consensus sequence; generating a second forward consensus sequence from the second sub-read and the fourth sub-read; generating a second reverse consensus sequence from the second sub-read and the fourth sub-read; combining the second forward consensus sequence and the second reverse consensus sequence into a second combined consensus sequence; appending a second combined consensus sequence to the end of the first combined consensus sequence by alignment at the alignment anchor sequence; and deleting the aligned anchor sequences.

Clause 25. a system for alignment of sequence reads, comprising: one or more components for processing; a memory coupled to one or more components for processing; means for identifying an alignment anchor sequence having a predetermined sequence present in a first sequence read and a second sequence read, means for dividing the first sequence read into a first sub-read extending from the beginning of the first sequence read to the alignment anchor sequence and a second sub-read extending from the alignment anchor sequence to the end of the first polynucleotide sequence, means for dividing the second sequence read into a third sub-read extending from the beginning of the second sequence read to the alignment anchor sequence and a fourth sub-read extending from the alignment anchor sequence to the end of the second polynucleotide sequence; means for aligning the first sub-read with the third sub-read based on the start of the first sequence read, the start of the second sequence read, and the alignment anchor sequence, and means for aligning the second sub-read with the fourth sub-read based on the alignment anchor sequence, the end of the first sequence read, and the end of the second sequence read.

Conclusion

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts are disclosed as example forms of implementing the claims.

52页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:健康管理辅助装置、健康管理辅助方法、健康管理辅助终端和程序

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!