Predicting quality of sequencing results using deep neural networks

文档序号：1525434 发布日期：2020-02-11 浏览：26次中文

阅读说明：本技术 使用深度神经网络预测测序结果的质量 (Predicting quality of sequencing results using deep neural networks ) 是由 A·杜塔 A·起亚于 2019-01-04 设计创作，主要内容包括：所公开的技术预测在扩展的光学碱基识别过程期间的碱基识别质量。碱基识别过程包括前预测碱基识别过程循环和至少两倍于前预测循环的后预测碱基识别过程循环。将来自前预测碱基识别循环的多个时间序列作为输入提供给经过训练的卷积神经网络。卷积神经网络根据前预测碱基识别过程循环来确定后预测碱基识别过程循环之后预期的可能的总碱基识别质量。当碱基识别过程包括成对读取的序列时,还将第一读取的总碱基识别质量时间序列作为附加输入提供给卷积神经网络,以确定在第二读取的后预测循环之后可能的总碱基识别质量。(The disclosed techniques predict base recognition quality during an extended optical base recognition process. The base recognition process includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. A plurality of time sequences from a previous predicted base recognition cycle are provided as inputs to a trained convolutional neural network. The convolutional neural network determines the expected total likely base recognition quality after the post-predicted base recognition process cycle from the pre-predicted base recognition process cycle. When the base recognition process includes paired reads of sequences, the total base recognition mass time series of the first read is also provided as an additional input to the convolutional neural network to determine the possible total base recognition mass after the post-prediction cycle of the second read.)

1. A computer-implemented method for early prediction of base recognition quality during an extended optical base recognition process comprising a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice the pre-prediction cycle, wherein each base recognition process cycle comprises: (a) chemical processing of target nucleotide strands with additional complementary nucleotides added to millions of positions on a substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch, the method comprising:

inputting a plurality of time series from the pre-prediction base recognition process cycle into a trained convolutional neural network, the plurality of time series comprising a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base recognition quality time series;

wherein the trained convolutional neural network is trained using base recognition quality experience comprising a plurality of time sequences of the pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence;

the trained convolutional neural network determining a likely total base recognition quality expected after a post-prediction base recognition process cycle that is at least twice the pre-prediction cycle, based on the pre-prediction base recognition process cycle; and

outputting the possible total base identification quality for evaluation by an operator.

2. The computer-implemented method of claim 1, wherein chemical processing performance is represented in the chemical processing subsystem performance time series by phasing and an estimate of a predetermined phase error.

3. The computer-implemented method of any of claims 1 to 2, wherein image registration performance is represented in the image registration subsystem performance time series by a report of post-image capture x and y image offset adjustments.

4. The computer-implemented method of any of claims 1 to 3, wherein image acquisition performance is represented in the image acquisition subsystem performance time series by focus and contrast reports.

5. The computer-implemented method of claim 4, wherein the focus is represented by a narrowness of a full width at half maximum of each cluster in the cluster image.

6. The computer-implemented method of claim 4, wherein the contrast comprises a minimum contrast calculated as a 10 th percentile for each channel of a list of images.

7. The computer-implemented method of claim 4, wherein the contrast comprises a maximum contrast calculated as the 99.5 th percentile for each channel of a column of images.

8. The computer-implemented method of claim 4, wherein the image acquisition performance further comprises cluster-intensity image acquisition subsystem performance time series reporting.

9. The computer-implemented method of claim 8, wherein the cluster intensity is reported at the 90 th percentile of the intensity of the imaged clusters.

10. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises a post-prediction base recognition process cycle that is 3 to 25 times as long as a pre-prediction cycle.

11. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises 2 to 50 times a post-prediction base recognition process cycle as a pre-prediction cycle.

12. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises 20 to 50 pre-predicted base recognition process cycles.

13. The computer-implemented method of any one of claims 1 to 9, wherein the base recognition process comprises 100 to 500 post-prediction base recognition process cycles.

14. The computer-implemented method of claim 1, further comprising determining possible total base identification process qualities for at least five intermediate cycle counts from the previous predicted base identification process cycle during the post-predicted base identification process cycle, and outputting the intermediate possible total base identification quality determinations.

15. A computer-implemented method for early prediction of base recognition quality during an extended optical base recognition process, the extended optical base recognition process comprising sequences read in pairs, each read comprising a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice the pre-predicted cycle, each base recognition process cycle comprising: (a) chemical processing of target nucleotide strands with additional complementary nucleotides added to millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch, the method comprising:

inputting into the trained convolutional neural network:

a plurality of time series of the pre-prediction base recognition process cycle from a second read, the plurality of time series comprising a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base recognition quality time series, and

a first read total base identification mass time series;

wherein the trained convolutional neural network is trained using base recognition quality experience comprising a plurality of time sequences of the pre-predicted base recognition process cycle of the second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read;

the trained convolutional neural network determining a likely total base recognition mass of the second read expected after at least twice a post-predicted base recognition process cycle of the pre-prediction cycle from the pre-predicted base recognition process cycle of the second read and a total base recognition mass time series of the first read; and

outputting the possible total base identification quality of the second reading for evaluation by an operator.

16. A system comprising one or more processors coupled to a memory loaded with computer instructions to perform early prediction of base recognition quality during an extended optical base recognition process comprising a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle, wherein each base recognition process cycle comprises: (a) chemical processing of target nucleotide strands with additional complementary nucleotides attached to millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch; the instructions, when executed on the processor, perform operations comprising:

outputting the possible total base identification quality for evaluation by an operator.

17. The system of claim 16 wherein chemical treatment performance is represented in the chemical treatment subsystem performance time series by phasing and an estimate of a predetermined phase error.

18. A non-transitory computer readable medium having computer executable instructions for implementing early prediction of a neural network-based base recognition quality system as claimed in any one of claims 1 to 15.

19. A computer system running on a number of parallel processors adapted to perform the computer-implemented method of any of claims 1 to 15.

Technical Field

The disclosed technology relates to artificial intelligence type computers and digital data processing systems, and corresponding data processing methods and products for intelligent simulation, including machine learning systems and artificial neural networks. In particular, the disclosed techniques relate to analyzing ordered data using deep learning and deep convolutional neural networks.

Background

The subject matter discussed in the background section should not be admitted to be prior art merely as a result of its mention in the background section. Similarly, the problems mentioned in the background section or related to the subject matter of the background section should not be considered as having been previously acknowledged in the prior art. The subject matter in the background section merely represents different approaches that may themselves correspond to implementations of the claimed technology.

Various protocols in biological or chemical research involve performing a large number of controlled reaction cycles. Some DNA sequencing protocols, such as sequencing-by-synthesis (SBS), detect light emissions from a series of reaction sites. In SBS, a plurality of fluorescently labeled nucleotides are used to sequence nucleic acids of a large number of amplified DNA clusters (or clonal populations) located on the surface of a substrate. For example, the surface may define a channel in the flow channel. The nucleic acid sequences in the different clusters are determined by running through hundreds of cycles in which fluorescently labeled nucleotides are added to the clusters, which are then excited by a light source to provide light emission.

Although SBS is an effective technique for determining nucleic acid sequences, SBS operations may take three days or more to complete. Some operations fail due to quality issues. Reliably predicting the final quality of a sequencing run over several cycles would be beneficial to users of sequencing instruments, allowing them to stop failing runs after half a day or less. The operator of the sequencing instrument cannot predict in advance the final quality of the sequencing run.

Fortunately, a large amount of subsystem performance data has been collected for performing troubleshooting. This subsystem data can be combined and used to predict the total base identification quality at the end of a sequencing read or run, and at intervals during the read. By using subsystem performance indicators reported early in the run, the trained deep neural network can predict the likely total base recognition quality.

Drawings

The drawings are included for illustrative purposes and are used only to provide examples of possible structures and process operations of one or more embodiments of the present disclosure. These drawings in no way limit any changes in form and detail that may be made to the disclosure by one skilled in the art without departing from the spirit and scope of the disclosure. A more complete understanding of the subject matter may be derived by referring to the detailed description and claims when considered in conjunction with the following figures, wherein like reference numbers refer to similar elements throughout the figures.

Fig. 1 shows an architecture level schematic of a system in which a machine learning system including a quality prediction convolutional neural network predicts the overall base recognition quality of sequencing data generated by a sequencing system.

Figure 2 shows the subsystem performance and total base identity quality data stored per cycle in the sequencing quality database of figure 1.

FIG. 3 illustrates the processing of an input having one channel by different layers of the quality prediction convolutional neural network of FIG. 1.

FIG. 4 illustrates the processing of an input having four channels by different layers of the quality prediction convolutional neural network of FIG. 1.

FIG. 5 shows an example of subsystem performance data and total base identity quality data stored in the sequencing quality database of FIG. 1.

Figure 6 shows a graphical representation of total base identity quality data for two reads of an example sequencing run.

Fig. 7 shows total base identification quality data for two reads of two example sequencing runs, indicating the predicted total base identification quality in different target cycles.

FIG. 8 shows example data for predicted and true total base identification quality data within a target cycle, and a graph of a comparison of validation data and test data within an intermediate target cycle.

Fig. 9 shows an example of an architecture level schematic of the quality prediction convolutional neural network of fig. 1 in training and production.

Fig. 10 is a simplified block diagram of a computer system that may be used to implement the machine learning system of fig. 1.

Detailed Description

The following detailed description is made with reference to the accompanying drawings. Example embodiments are described to illustrate the disclosed technology and not to limit its scope (as defined by the claims). Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows.

Introduction to

The quality of base recognition is a measure of the success or failure of nucleotide sequencing in a DNA or RNA molecule. Sequencing-by-synthesis (SBS) is a sequencing technique that involves adding complementary nucleotides one at a time to a nucleotide sequence fragment of the DNA to be sequenced. Optical platforms using SBS can sequence billions of clusters of nucleotide sequence fragments (sometimes referred to as molecules) arranged in multiple lanes, each lane having a small block, on a slide or flow cell. Molecular clusters represent clones of molecules. The cloned molecule amplifies the signal generated during SBS.

Sequencing nucleotides in a molecule requires hundreds of cycles. The clonal clusters are ready for the SBS process before the cycle begins. In one cycle, there are chemical operations, image capture operations, and image processing operations. The chemical manipulation is designed to add one dye-labeled complementary nucleotide per molecule in each cluster per cycle. When a molecule falls behind or exceeds SBS relative to other molecules within its cluster, it loses phase (out of phase), referred to as phasing (phasing) or pre-phasing. The image capture operation involves aligning the camera with a tile in the track, illuminating the tile, and capturing one to four images. Image processing results in base recognition, which means that complementary nucleotides added to the molecules in the cluster are recognized in one cycle. Dye chemistry, illumination, camera design and the number of images captured vary across sequencing platforms. The sequencing instrument may provide subsystem performance metrics for chemistry, camera positioning or registration, image capture or acquisition, and overall base identification quality.

Sequencing of 350 nucleotide molecules by SBS can involve 300 or more processing cycles in flight. The run was divided into two reads starting from the 3 'and 5' ends of the same sequence fragment. When the number of cycles is less than the length of the molecule, an un-sequenced region will remain in the middle of the molecule after reading from the 3 'and 5' ends is complete.

Sequencing the human genome requires parallel sequencing of many molecules of DNA fragments, since the human genome comprises approximately 30 hundred million base pairs. These base pairs are organized in 23 pairs of human chromosomes that replicate in each cell. The 300 cycles of combining partial sequences into a whole genome and subsequent processing may take 3 days or more to complete. Some operations fail due to quality issues. Reliably predicting the final quality of a sequencing run over several cycles would be beneficial to users of sequencing instruments, allowing them to stop failing runs after half a day or less.

The operator of the sequencing instrument cannot predict in advance the final quality of the sequencing run. Fortunately, a large amount of subsystem performance data has been collected for performing troubleshooting. This subsystem data can be combined and used to predict the total base identification quality at the end of a sequencing read or run, and at intervals during the read. By using subsystem performance indicators reported early in the run, the trained deep neural network can predict the likely total base recognition quality.

When the run includes two reads from both ends of the molecule, a similar, even earlier prediction can be made for the second read. Since the second read immediately follows the first read, data late in the first read may be combined with data early in the second read. This may significantly reduce the number of cycles required for the second read. For example, if the subsystem performance data is used for 25 cycles during the first read, it is sufficient to combine only 5 cycles of data at the second read with 20 cycles of data at the first read. Separate predictions of the quality of the first and second readings may be made.

Environment(s)

We describe a system for early prediction of base recognition quality in an extended optical base recognition process. There are four nucleotides in a DNA molecule-adenine (A), cytosine (C), guanine (G) and thymine (T). Base recognition refers to the process of determining the nucleotide bases (a, C, G, T) of each cluster of DNA molecules in one cycle of a sequencing run. The system is described with reference to fig. 1, which shows an architecture level schematic of the system according to one embodiment. Since fig. 1 is an architectural diagram, certain details are intentionally omitted to improve clarity of the description. The discussion for fig. 1 is arranged as follows. First, elements of the figure are described, followed by a description of their interconnections. The use of elements in the system is then described in more detail.

Fig. 1 includes a system 100. The system 100 includes a sequencing system 111, a sequencing quality database 115, a machine learning system 151 in training mode, a machine learning system 159 in production mode, and an operator 165 that monitors subsystem performance and base call quality. The disclosed techniques are applicable to a variety of sequencing systems 111, also known as sequencing instruments or sequencing platforms. Some examples of sequencing system 111 include HiSeqX from Illumina ^TM、HiSeq3000 ^TM、HiSeq4000 ^TM、NovaSeq 6000 ^TMAnd MiSeqDx ^TM. These sequencing systems are configured to base using sequencing-by-synthesis (SBS) techniquesAnd (5) identifying.

In SBS, a laser is used to irradiate dye-labeled complementary nucleotides attached to each molecule in each cluster in each cycle. The camera takes an image of the patch, which is then processed to identify the nucleotides attached to the molecules in the cluster (A, C, G, T). Some sequencing systems use four channels to identify the four nucleotides attached to the molecule in each cycle (A, C, G, T). In such a system, four images are generated, each containing signals having a single different color for each image. These four colors correspond to the four possible nucleotides present at a particular position. In another sequencing system, dual channels are used to identify four nucleotides (A, C, G, T). In such a system, two images are taken per cycle. A first nucleotide type is detected in the first channel, a second nucleotide type is detected in the second channel, a third nucleotide type is detected in both the first and second channels, and a fourth nucleotide type lacking a dye-tagged label is not or minimally detected in both channels.

The sequencing quality database 115 stores subsystem performance and total base identification quality data for each cycle. In one embodiment, the sequencing quality database 115 stores chemical processing subsystem performance data 116, image registration subsystem performance data 117, image acquisition subsystem performance data 118, and total base identification quality data 119. The sequencing quality database 115 stores these data for each cycle of a particular sequencing system. The sequencing system 111 sequences the molecules in a sequencing run. As described above, molecules of 350 nucleotides were sequenced using SBS, involving 300 or more processing cycles in the sequencing run. The sequencing run is sometimes divided into two reads starting at the 3 'end and the 5' end of the same sequencing molecule (also called fragment or insert). This is also referred to as double-ended reading. In one sequencer, each of the two reads from both ends of the molecule involved 150 base recognition cycles. The number of cycles per read can be as much as 150, varying with sequencer type. The disclosed technique divides the total cycle of each read into a pre-prediction cycle and a post-prediction cycle. In one embodiment of system 100, the first 25 cycles of read 1 are pre-prediction cycles, and the next 125 cycles are post-prediction cycles. Read 2 has fewer pre-prediction cycles. In one embodiment, the first 5 cycles of read 2 are pre-prediction cycles, and the subsequent 145 cycles are post-prediction cycles. It will be appreciated that fewer or more cycles may be used in each read as the pre-prediction and post-prediction cycles, respectively.

Machine learning system 151 includes a database containing training data 161, validation data 162, and test data 163. These three data sets contain data from the sequencing quality database 115 for the performance of the subsystem and the total base identification quality from previous sequencing runs of the sequencing system 111. Each cycle organizes data indicating the performance of the subsystem in the previous cycle of the predicted base recognition process and the total base recognition quality for all cycles of the sequencing run. These data sets are used to train and test the performance of the quality prediction convolutional neural network 171. Each quality prediction convolutional neural network includes one or more Convolutional Neural Networks (CNN) and a fully-connected (FC) network. In the training mode of the machine learning system 151, forward pass and backward propagation are used, whereas the production mode of the machine learning system 159 performs only forward pass. In forward pass, the machine learning system predicts the likely total base recognition quality expected in the post-target prediction base recognition process cycle. In back propagation, the machine learning system computes the gradient of one or more cost functions and propagates the gradient to the CNN and FC neural networks during training.

In one embodiment of the system 100, the machine learning system 151 includes a quality prediction convolutional neural network 171. The quality prediction convolutional neural network 171 predicts the likely total base recognition quality expected at the end of the read post-prediction base recognition process cycle. In one embodiment of the system 100, the number of post-prediction cycles read is at least twice the number of pre-prediction base recognition process cycles. In another embodiment of the system 100, the quality-predicting convolutional neural network 171 predicts the likely total base recognition quality for at least five intermediate cycle counts during a post-prediction base recognition process cycle. The quality prediction convolutional neural network 171 outputs an intermediate possible total base identification quality determination result for each of the five intermediate cycles. In another embodiment of the system 100, the quality prediction convolutional neural network 171 predicts the likely total base identification quality expected in each post-prediction cycle of the reading.

In another embodiment of the system 100, the machine learning system 151 includes a plurality of quality prediction convolutional neural networks 171. Each quality-predictive convolutional neural network is trained using training data 161 alone, the training data 161 comprising a performance time series and a total base recognition quality time series for the subsystems of the pre-prediction base recognition process cycle and a post-prediction total base recognition quality time series for the target cycle. In this embodiment of the system 100, a particular trained convolutional neural network determines the expected total likely base recognition quality at the target cycle from the previous predicted base recognition process cycle. The target loop may be the last loop of the read or any intermediate loop in the post-prediction loop. In fig. 1, the exemplary machine learning system 151 includes a quality prediction convolutional neural network for target cycles up to the last cycle, e.g., the 100 th or 150 th base recognition cycle in a read, in 5 or 10 cycle increments.

A trained quality prediction convolutional neural network 179 is deployed in production mode and is shown in fig. 1 as part of the machine learning system 159. The machine learning system 159 also includes a production database 169. The production database 169 contains the performance data 116, 117, 118 and total base identification quality data 119 for the subsystems of each previous prediction cycle of the sequencing system. The trained quality prediction convolutional neural network 179 determines the expected total base recognition quality possible after a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle, based on the pre-prediction base recognition process cycle. As described above, in one embodiment of the system 100, a single trained mass prediction convolutional neural network 179 can be used to predict the expected total base identification mass for multiple target cycles in a post-production cycle for one read of sequencing data. In another embodiment of the system 100, as shown in FIG. 1, a separate trained quality prediction convolutional neural network is used for each target cycle. The machine learning systems in training 151 and production 159 may run on various hardware processors, such as Graphics Processing Units (GPUs). Neural network-based models involve computationally intensive methods such as convolution and matrix based operations. GPUs are well suited for these types of computations. Recently, specialized hardware is under development to efficiently train neural network models.

Sequencing quality data

Fig. 2 shows the sequencing quality indicators 213 stored in the sequencing quality database 115 for each cycle of a sequencing run of a particular sequencing system 200. Fig. 2 lists some example subsystem performance indicators at a higher level of abstraction. These indices include chemical processing subsystem performance data 116, image registration subsystem performance data 117, and image acquisition subsystem performance data 118. The total base identification quality data 119 for each cycle is also provided as input to the machine learning system. In fig. 2, for illustrative purposes, subsystem performance data 116, 117, and 118 for "n" sequencing cycles of reads for a sequencing run are shown. The total number of cycles read is shown as "3 n". The first "n" sequencing cycles are the pre-prediction cycles, and the subsequent "2 n" cycles are the post-prediction cycles in the reads of the sequencing run.

Sequencing quality data 219 for each cycle shows subsystem performance indicators at a lower abstraction level. Chemical processing subsystem Performance data 116 includes two metrics, shown as a phasing metric C for the first "n" cycles of a sequencing run _n1And a predetermined amount of phasor C _n2. In each cycle of sequencing-by-synthesis (SBS) technology, a chemical process attaches a complementary nucleotide to a target nucleotide strand (or molecule) at millions of locations on a substrate. The term "phasing" describes the situation where a molecule in a molecular cluster lags other molecules in the same cluster by at least one base during the sequencing process. This is due to incomplete chemical reactions. The order of arrangement of these molecules is not in phase with the rest of the cluster. More specifically, these molecules lag behind other molecules in the cluster by one cycle. This effect is cumulative, and once a molecule falls behind, it cannot catch up with other molecules in the cluster. In the next cycle, there may be more molecules falling behind.

The term "prephasing" refers to the situation where one molecule is at least one base earlier than the other molecules in the same cluster. One reason for the pre-phasing is to incorporate one non-terminated nucleotide followed by a second nucleotide in the same sequencing cycle. The sequencing quality data 219 includes a phasing metric and a predetermined phasing metric for a previous prediction cycle of a read. In an embodiment of system 100, there are two reads for one sequencing run, with a read of 1 for the sequencing run and a value of "25" for "n" for a read of 2 and a value of "5" for "n". It is understood that in read 1 and read 2, different numbers of cycles may be used as pre-prediction cycles.

In embodiments, the sequencing system 111 provides two types of cluster arrangements on the small pieces of the flow cell, referred to as random and patterned. The sequencing system 111 uses a camera to capture images of the clusters in small blocks on the flowcell during the sequencing cycle. The process of aligning a virtual image (also referred to as a template) with a given sequencing image is referred to as registration. For image registration with randomly arranged cluster positions on the flowcell, a template is generated in the first few cycles (e.g., 5 cycles) of the sequencing run that identifies the cluster positions (x and y positions) on the flowcell. Image registration subsystem performance data includes an "x" offset adjustment R for cluster positions in an image of the first "n" cycles (also referred to as the read pre-prediction cycles of a sequencing run) _n1And "y" offset adjusts R _n2。

Alternatively, the second cluster formation technique used by sequencing system 111 is based on patterned flowcells. The patterned flow cell has an array of nanowells, allowing for higher cluster density and unambiguous cluster identification. For flow channels with patterned cluster locations, the template generation process is replaced by a step that places the hexagonal packing lattice of clusters at the x, y locations of the region corresponding to the patch size. The virtual image (or template) is replaced with a ring reference virtual image that is associated with a portion of the sequenced image that contains the actual ring reference. The image registration subsystem performance data in such a sequencing system is the same as that proposed above for a sequencing system with randomly arranged cluster positions.

During each cycle of the SBS technique, four complementary nucleotides (A, C, G, T) are simultaneously transferred to molecular clusters on the patches arranged in lanes on the flow cell. Each nucleotide has a spectrally different tag attached to it. A laser is used to irradiate dye-labeled complementary nucleotides attached to each molecule in each cluster in each cycle. The camera takes an image of the patch, which is then processed to identify the nucleotides attached to the molecules in the cluster (A, C, G, T). Some sequencing systems use four channels to identify the four types of nucleotides attached to the molecule per cycle (A, C, G, T). In such a system, four images are generated, each containing signals having a single different color for each image. These four colors correspond to the four possible nucleotides present at a particular position. Four images are then obtained, each using a detection channel selective to one of four different labels. The identified tags are then used to identify bases for each cluster. In this embodiment, the "x" offset of the loop "n" adjusts R _n1And "y" offset adjusts R _n2Is provided as input to the machine learning system as one per channel.

In another type of sequencing system, dual channels are used to identify four complementary nucleotides attached to a molecule (A, C, G, T). In such a system, two images are taken per cycle. A first nucleotide type is detected in the first channel, a second nucleotide type is detected in the second channel, a third nucleotide type is detected in both the first and second channels, and a fourth nucleotide type lacking a dye-tagged label is not or minimally detected in both channels. As described above, the sequencing quality data 219 includes image registration subsystem performance data for the first "n" cycles (also referred to as the read pre-prediction cycles of the sequencing run).

Image of a personAcquiring subsystem performance data includes the focus score A for the first "n" cycles of a sequencing run _n1Minimum contrast metric a _n2Maximum contrast metric A _n3And a strength metric A _n4. The focus fraction is defined as the mean full width at half maximum (FWHM) of the molecular clusters, expressed in pixels as their approximate size. The minimum and maximum contrast values are the 10 th and 95.5 th percentiles, respectively, of each channel of the selected column of the original image. The selected column may be a particular patch or lane of flow cells. The process of determining an intensity value for each cluster in the template of a given sequencing image is referred to as intensity extraction. To extract the intensity, a background of the clusters is calculated using a portion of the image containing the clusters. The background signal is subtracted from the signal of the cluster to determine the intensity. The intensity extracted from the 90 th percentile of the data is stored in the sequencing quality data 219. Image acquisition subsystem performance data for the first "n" previous prediction cycles of the read sequencing run is stored in the sequencing quality database 219. In one embodiment, each image acquisition subsystem performance data value includes four values corresponding to the four channels discussed above.

The total base identity quality data 119 is given as input Q30 for all "3 n" cycles in the read of the sequencing run. Quality scoring is a widely used technique in DNA sequencing to determine the confidence of the correctness of base recognition, and thus assign a Phred quality score. For example, Illumina corporation uses a pre-trained instrument-specific model to obtain the quality of base recognition in each cycle of a sequencing system (also referred to as a sequencing instrument). The percentage of bases higher than Q30 (also referred to as% Q >30) indicates the percentage of base recognitions with a mass fraction of 30 or higher. A mass fraction of 30 indicates that the accuracy of base discrimination was 3 to 9, or 99.9%. Similarly, a mass fraction of 20 means that the accuracy of base recognition is 99%. A mass fraction of 40 indicates that the accuracy of base discrimination was 99.99%. During a sequencing run, the% Q >30 indicator can be viewed at different levels (e.g., per patch per cycle, or the average of all patches across lanes per cycle, or the average of all patches per cycle, and the "total" average of the entire sequencing run).

The quality of operation can be measured by% Q>Higher% Q determined by a value of 30>The 30 values represent a higher number of bases that can be reliably used for downstream data analysis. Each sequencing System of Illumina Inc. has the expected% Q>30 specification. For example, for HiSeqX ^TMThe system, on average, greater than or equal to 75% of the bases are expected to be higher than Q30 for sequencing reads that are 150 nucleotide (also referred to as base) paired-end reads. In one embodiment of the system 100, during training, each post-prediction sequencing cycle (Q30) _n+1To Q30 _3n) Is an average of 10 cycles. For example, the total base identification quality value for the post-prediction loop 50 is an average of the total base identification quality values for loops 45 through 54.

Subsystem performance data 116, 117, and 118 for the pre-prediction cycle ("n") and the total base identification data 119 for all cycles ("3 n") read are stored in the sequencing quality database 219. In one embodiment of the system 100, additional sequencing quality indicators are used as input to a machine learning system. Examples of such indicators include data reported by temperature sensors and laser power sensors in the flow cell. The data reported by the sensors in the sequencing system 111 is used to monitor system performance during a sequencing run. Sometimes, the data reported by the sensors may also include data before and after the sequencing run. Further examples of metrics that may be used as inputs to the machine learning system include an error metric for each cycle (including cycle error rate) and counts of complete reads and reads with one to four errors. It will be appreciated that additional metrics may be included as input to the machine learning system for predicting the overall base recognition quality of a sequencing run.

Quality prediction convolutional neural network

Fig. 3 shows the layers of the quality prediction Convolutional Neural Network (CNN)300 of fig. 1. FIG. 3 is an embodiment with two convolutional layers. The network may have one to five convolutional layers. In other embodiments, the network may have more than five convolutional layers. One way to analyze the output of the convolution is through a Fully Connected (FC) network. Thus, at the last layer of the quality prediction CNN, the output of the convolutional layer is provided to the FC network. The fully-connected layer may be implemented as a multilayer sensor having two to five layers. One output from the FC network can be used to predict the likely total base identification quality expected in a particular target cycle in a post-prediction loop of reads. In an embodiment of such a system, a separate machine learning system is trained to predict the likely total base identification quality expected in each target cycle. In an alternative embodiment of the machine learning system, a plurality of outputs from the FC network may be used to predict the total base identification quality expected in a plurality of post-prediction target cycles.

In fig. 3, the dimensions of the input of each layer of the quality prediction CNN are shown in parentheses. As mentioned above, some inputs of the quality prediction CNN have one channel, while other inputs may have four channels. The example quality prediction CNN shown in fig. 3 is used for an input with one channel. The dimension of the input time series indicates that there are 25 inputs, each input value comprising a one-dimensional value (311). This input can be thought of as a one-dimensional vector containing 25 real numbers. These 25 values correspond to a particular subsystem performance. For example, a chemical processing subsystem performance time series or a total base identification mass time series. As described above, there is one channel per cycle for both inputs. Each input is subjected to an independent convolution. The input is then passed through a batch normalization layer at block 321.

In a Convolutional Neural Network (CNN), the distribution of each layer changes during training, and the distribution of each layer varies from layer to layer. This reduces the convergence speed of the optimization algorithm. Batch normalization (Ioffe and szegdy 2015) is a technique to solve this problem. The input to the batch normalization layer is denoted by x, the output is denoted by z, and the batch normalization applies the following transformation to x:

batch normalization applies mean variance normalization to input x using μ and σ and linear scales and transforms it using γ and β.

The output from batch normalization layer 321 is provided as input to convolutional layer 331. Batch normalization does not change the dimension of the input. In the example of the convolutional layer shown in fig. 3, 64 filters of width 5 and height 1 are convolved on the input with two zeros on each side. During convolution, zero padding is used to process edges. Zero-filling an H × W input with pad 2 can be considered as creating a zero matrix of size (H +2pad) × (W +2pad) and copying the input into the matrix so that it is exactly in the middle of the zero matrix. If the convolution filter is (2pad +1) × (2pad +1) in size, the result of the convolution with zero-padded input is HxW, exactly equal to the input size. Padding is typically done to keep the input and output of the convolution operation constant in size.

The output of the first convolution layer 331 includes 25 values, each having 64 lanes and a width. The output of the convolution is also referred to as a feature map. The output is provided as an input to max pooling layer 343. The goal of the pooling layer is to reduce the dimensionality of the feature map. Therefore, it is also called "downsampling". The factor by which the downsampling is to be performed is called the "step size" or "downsampling factor". The pooling step is denoted by "s". In one type of pooling, referred to as "maximum pooling", a maximum value is selected for each step. For example, consider applying the maximum pooling with s-2 to a 12-dimensional vector x-1, 10,8, 2, 3, 6, 7, 0, 5, 4, 9, 2. The largest pooling vector x with step s of 2 means that we select the largest value out of every two values starting with index 0, resulting in vector [10,8,6, 7,5,9 ]. Thus, a maximum pooling vector x with step s of 2 will result in a 6-dimensional vector. The max pooling layer 343 reduces the dimensionality of the output 341 of the first convolution from 25 values to 12 values using a step size of s-2. The value of bit 25 in output 341 is discarded.

The output of max pooling layer 343 is passed through batch normalization layer 347 before being provided as input to the next convolution layer 351. In the convolutional layer, 64 kernels of size 5 by 1(5 × 1) are convolved in 64 dimensions, respectively, to generate an output feature map of size 64 by 12(64 × 12). The summation operation is performed over 64 dimensions to generate a feature map of size 1 by 12(1 × 12). This convolutional layer has 128 kernels, so the above operations are performed 128 times, generating an output feature map with dimensions 128 by 12(128 × 12). As described above, the second convolutional layer also operates on inputs with two zero padding per edge. The output of the second convolutional layer is shown as block 361. The max pooling layer 363 with step s 2 reduces the output of the convolution from 12 to 6 values, 128 channels each, which is passed through the third batch normalization layer at block 365. The output from batch normalization layer 365 is provided to a summation layer, followed by two Fully Connected (FC) networks. The details of these layers are shown in figure 4.

The discarding method is a simple and effective technique to prevent overfitting of the neural network. Its working principle is to randomly discard a small fraction of neurons from the network in each iteration of training. This means that the output and gradient of the selected neurons are set to zero, so they do not have any effect on the forward and backward propagation. In the example of the quality prediction convolutional neural network shown in fig. 3, discarding is performed using a probability of 0.3 before the second and third batch normalization layers 347 and 365, respectively.

Fig. 4 shows the architecture of an example quality prediction Convolutional Neural Network (CNN)400, which is similar to that shown in fig. 3, but is designed for an input having four channels. As described above, the image registration subsystem performance time series and the image acquisition subsystem performance time series are composed of four channels of data. These channels may correspond to four nucleotides (A, C, G, T). In one embodiment of quality prediction CNN, the four channels of each input are combined together before the quality prediction CNN processes the inputs. In another embodiment of the network, the quality prediction CNN takes the input of four channels and produces the output of four channels corresponding to the input. The four channels of each output value from the quality prediction CNN are added to obtain a value of one channel. In both embodiments, a summation operation of adding values in four passes for each input value may be used. In the example network shown in fig. 4, the convolution filter convolves on an input having four channels.

Input 411 includes 25 values corresponding to 25 previous prediction cycles in a read of a sequencing run. Each of the 25 input values is 1 in size and has four channels. At block 421, batch normalization is performed on the input. At block 431, a padding convolution with two zero padding is performed. Four kernels of size 5 by 1(5 × 1) are convolved over four channels, generating a feature map of size 4 by 25(4 × 25). The summation operation is performed over four dimensions to generate a feature map of size 1 by 25(1 × 25). The above operation is performed 64 times because there are 64 kernels producing outputs of dimensions 64 by 25(64 x 25), as shown at block 443. Maximum pooling at step s of 2 is performed at block 445 resulting in 64 feature maps of size 12. At block 449, the output of the max-pooling layer is passed through a second batch normalization.

At block 451, a second convolution is performed using 128 filters of size 5. The second convolution convolves the filter over the input with two zero-padding on each side. The output of the second convolution includes 128 size 12 feature maps, as shown in block 461. Maximum pooling at step s of 2 reduces the dimension to 128 feature maps of size 6 at block 463. At block 465, a third batch normalization is performed. The convolved outputs of all inputs (465 and 365) are summed at summing layer 467. The input to summing layer 467 is 9 feature maps corresponding to the 9 input features. The dimension of each feature map is 6 times 128(6 × 128). The summing layer 467 sums the 9 features to reduce the dimensionality to 768 inputs (6 × 128). The output of the summing layer 467 is then passed to a first Fully Connected (FC) network 471 after flattening. FC 471 produces 64 outputs which are provided as inputs to a second FC network 481, producing an output 491. The output 491 predicts the likely total base identification quality for the target cycle for the operator 165.

Examples of subsystem Performance data

Fig. 5 shows example data 500 of chemical processing subsystem performance 116, image registration subsystem performance 117, image acquisition subsystem performance 118, and total base identification quality 119. The data is ranked according to the performance metrics of each subsystem. For example, chemical processing subsystem performance data 116 includes phasing and a predetermined amount of phasing. Similarly, the image registration subsystem performance data 117 includes translation x and translation y metrics. The image acquisition subsystem performance data 118 includes intensity, maximum contrast, minimum contrast, and focus score metrics. The total base call quality data 119 indicates the percentage of base calls above the Q30 quality metric. In one embodiment of the system, the above data from 23,000 sequencing runs of the HiSeqX, HiSeq3000 and HiSeq4000 sequencing machines from Illumina corporation are used to train the quality prediction convolutional neural network 171.

Base recognition quality prediction result analysis

Fig. 6 includes a graph 600 illustrating average total base call quality results (Q30) for an example sequencing run in a sequencing system. Sequencing runs included paired end reads: read 1 and read 2. Each read contained 150 base recognitions, corresponding to 150 sequencing cycles, with one complementary nucleotide appended to the molecules arranged in clusters on the flowcell patch. The two reads are separated by an index read. In some sequencing runs, molecules from multiple source DNA samples are sequenced together. Index reads are used to identify sequencing data belonging to a unique source DNA sample.

In one embodiment, the first 25 cycles of read 1 and the first 5 cycles of read 2 are used as the previous predicted base recognition process cycle. Subsystem performance data 116, 117, and 118 from the previous prediction loop and total base identification quality data 119 are provided as inputs to a quality prediction Convolutional Neural Network (CNN). In another embodiment, the total base identification mass fraction of the last 20 cycles of read 1 is also provided as input to the mass prediction CNN of read 2. Graph 600 shows that the average Q30 score for an example sequencing run decreases as molecular sequencing progresses. Because the chemistry performed in a sequencing cycle is a random process, errors in the chemical processing steps in each cycle accumulate. As more sequencing cycles are performed, the errors in the previous cycles are summed to create the attenuation indicated by the curves for read 1 and read 2 in the graph 600.

Figure 7 shows the total base recognition quality prediction 700 and confidence interval for two example paired-end sequencing runs shown in graphs 711 and 751. The actual average Q30 values for the two sequencing runs are shown as the "read 1" and "read 2" curves. The trained quality prediction Convolutional Neural Network (CNN)179 outputs the possible total base recognition quality for cycle 150, which 150 is the last cycle of read 1. Quality prediction CNN also predicts the base recognition quality of the intermediate cycles starting at cycle 30 and continuing at intervals of 10 sequencing cycles (e.g., cycles 30, 40, 50, 60) to cycle 150 until the end of the read. The confidence interval between the predicted value and each prediction is indicated by a box.

In an embodiment, an integration consisting of three trained quality prediction Convolutional Neural Networks (CNN)179 is used during production (also referred to as "inference") to predict the likely total base recognition quality of the target cycle. According to one embodiment, each of the three models is run 100 times to generate as many predicted values. Then, the average value of the total 300 predicted values generated by the three quality predictions CNN is used as the final prediction result. The standard deviation of the predicted values is used as confidence interval. Reads with a total base identification quality value close to the training data may have a lower uncertainty or shorter confidence interval. Prediction results that are far from the training examples may have high uncertainty.

During training, according to one embodiment, the total base call identification quality data 119 for each cycle may be closest to the average of 10 sequencing cycles. For example, the total base identification quality data for loop 50 is an average of the total base identification quality data for loops 45 through 54. In such embodiments, during production, the quality prediction CNN predicts the average (over 10 cycles) total base identification quality per target cycle. This is because, due to the fluctuations of a single cycle, it may be difficult to determine the performance of the quality prediction CNN within a particular target cycle. For example, a particular cycle, say cycle 50, is rejected, but the previous and subsequent cycles are not. Thus, the average of 10 cycles can be used to predict the total base identification quality for a particular target cycle.

Graph 711 shows an embodiment where the quality prediction CNN 179 is more confident in predicting the average Q30 score for the earlier target cycle of read 1 than the later target cycle. Read 2 of this sequencing run had a lower average Q30 score. Although the potential total base identification quality score predicted by CNN 179 is higher than the actual read 2 result, the operator 165 is informed of the predicted result after the first five cycles of read 2 and the potential total base identification quality of read 2 is lower.

Graph 751 illustrates an embodiment where the quality prediction CNN 179 predicts the quality score of the target cycle with high confidence and accuracy.

Using graphs 711 and 751, operator 165 can decide to continue or terminate a sequencing run after the results of read 1 and read 2 life cycle early review quality prediction CNN 179. In one embodiment, the prediction score and confidence value are submitted to the operator 165 at the end of cycle 25 for read 1 and at the end of cycle 5 for read 2.

Fig. 8 shows in graph 811 a comparison 800 of the true and predicted values for the percentage (% > Q30) of base calls above the Q30 score for cycle 100. It is clear that most of the data points are along the dashed line 821, which means that the predicted values are close to the true values. There are several data points in the upper left corner of the graph 811 indicating that the predicted value is higher compared to the true value for the percentage of base calls above the Q30 score. As mentioned above, base recognition is a random process, with each cycle involving several chemical processing steps. In the upper left hand corner of the graph, the prediction of the quality prediction CNN 179 does not approach the true value for a few cycles. However, since the quality prediction CNN predicts Q30 scores for multiple target cycles, the operator 165 can use the predicted values for all target cycles to make decisions regarding sequencing runs.

Graphs 861 and 865 show the performance of the quality prediction convolutional neural network for read 1 and read 2, respectively, of an example sequencing run. The subsystem performance metrics and total base identification quality for the first 25 cycles are used to predict the likely total base identification quality for read 1 at target cycles 50, 70, 100, 120, and 150. Likewise, the inputs of the first five cycles of read 2 are used to predict the likely total base identification quality of read 2 at the same target cycle.

Determining the coefficient to be represented as R ²And is the ratio of the variance in the dependent variable predicted from the independent variable. It is a statistical measure of the proximity of the predicted data to the true data points. R ²A "1" indicates that the regression data completely fit the real data. Graphs 861 and 865 show how close the model predicted values of the possible total base identification quality are to the true values in validation and testing.

Training and reasoning for quality-predictive convolutional neural networks

FIG. 9 shows schematic diagrams 911 and 961 of a training and production deployment 900 for a quality prediction Convolutional Neural Network (CNN), according to one embodiment. During training, performance data and total base recognition quality scores from the subsystems of the training database 161 are provided as inputs to the quality prediction CNN 171. Each quality prediction CNN comprises a plurality of layers, as shown in fig. 3, for an input having one channel, and as shown in fig. 4, for an input having four channels. In one embodiment, different quality predictions CNN are trained for specific inputs (i.e., subsystem performance time series and total base identification quality time series). In another embodiment, a single quality prediction CNN is trained for all inputs. In one embodiment, the output of the quality prediction CNN is the likely total base recognition quality of the target cycle in the reads of the sequencing run. The output is compared to the ground truth (ground true) base recognition quality of the target cycle. In one embodiment, the ground truth value is the average base recognition quality for 10 cycles of reads as discussed above. In the quality prediction CNN, the weight is updated by using the prediction error calculated between the output and the ground truth value, so that the output is closer to the ground truth value.

The trained quality prediction CNNs are deployed into a production environment where they receive production data for pre-prediction cycles read in a sequencing run of the sequencing instrument 111. During production (or reasoning), the quality-predictive CNN generates a likely total base recognition quality score for the target cycle in the post-predictive base recognition process cycle. The operator 165 can then compare the likely total base call quality scores read with the base call quality required for downstream data analysis. If the likely quality of the base call quality score of the post-prediction cycle is lower than the desired quality of base call, the system will alert the operator 165 that the operator 165 can abort the sequencing run.

Computer system

FIG. 10 is a simplified block diagram of a computer system 1000, the computer system 1000 being usable to implement the machine learning system 151 of FIG. 1 to make early predictions of base recognition quality during an extended optical base recognition process. A similar computer system 1000 may implement a machine learning system 159 for production or reasoning. The computer system 1000 includes at least one Central Processing Unit (CPU)1072, which communicates with a number of peripheral devices via a bus subsystem 1055. These peripheral devices may include a storage subsystem 1010, storage subsystem 1010 including, for example, a memory device and file storage subsystem 1036, a user interface input device 1038, a user interface output device 1076, and a network interface subsystem 1074. Input and output devices allow a user to interact with computer system 1000. Network interface subsystem 1074 provides an interface to external networks, including an interface to corresponding interface devices in other computer systems.

In one embodiment, the machine learning system 151 of fig. 1 is communicatively connected to a storage subsystem 1010 and a user interface input device 1038.

The user interface input device 1038 may include: a keyboard; a pointing device such as a mouse, trackball, touchpad, or graphics board; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and ways to input information into computer system 1000.

User interface output devices 1076 may include a display subsystem, a printer, a facsimile machine, or a non-visual display such as an audio output device. The display subsystem may include an LED display, a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide a non-visual display, such as an audio output device. In general, use of the term "output device" is intended to include all possible types of devices and ways to output information from computer system 1000 to a user or to another machine or computer system.

Storage subsystem 1010 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed by the deep learning processor 1078.

The deep learning processor 1078 may be a Graphics Processing Unit (GPU) or a Field Programmable Gate Array (FPGA). Deep learning processor 1078 may be implemented by a deep learning cloud platform (e.g., Google cloud platform (Google cloudplatform)) ^TM)、Xilinx ^TMAnd Cirrascale ^TM) And (4) hosting. An example of deep learning processor 1078 includes Google's Tensor Processing Unit (TPU) ^TM) Rack-mount solutions, e.g. GX4 rack-mount series ^TMGX8 series for installing frame ^TMDGX-1 of NVIDIA ^TMStratix V FPGA from Microsoft ^TMGraphcore's Intelligent Processing Unit (IPU) ^TM) High-pass Snapdagon processor ^TMZeroth platform ^TMVolta of NVIDIA ^TMDriving PX of NVIDIA ^TMJETSON TX1/TX2 module of NVIDIA ^TMIntel Nirvana ^TMMovidius VPU ^TMFujitsu DPI ^TMARM dynamicIQ ^TMIBM TrueNorth ^TMAnd the like.

The memory subsystem 1022 used in the storage subsystem 1010 may include a number of memories including a main Random Access Memory (RAM)1032 for storing instructions and data during program execution and a Read Only Memory (ROM)1034 where fixed instructions are stored. File storage subsystem 1036 may provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. Modules implementing the functionality of certain embodiments may be stored by file storage subsystem 1036 in storage subsystem 1010, or in other machines accessible by the processor.

Bus subsystem 1055 provides a mechanism for allowing the various components and subsystems of computer system 1000 to communicate with one another as desired. Although bus subsystem 1055 is shown schematically as a single bus, alternative embodiments of the bus subsystem may use multiple buses.

The computer system 1000 itself may be of various types, including a personal computer, portable computer, workstation, computer terminal, network computer, television, mainframe, server farm, widely distributed group of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1000 depicted in FIG. 10 is intended only as a specific example for purposes of illustrating the preferred embodiments of the present invention. Many other configurations of computer system 1000 may have more or fewer components than the computer system shown in FIG. 10.

Description of the preferred embodiments

The disclosed technology relates to early prediction of base recognition quality during extended optical base recognition processes.

The disclosed technology may be implemented as a system, method, or article of manufacture. One or more features of one embodiment may be combined with the basic embodiment. Non-mutually exclusive embodiments are taught as combinable. One or more features of one embodiment may be combined with other embodiments. The present disclosure alerts the user of these options periodically. The omissions in some embodiments from the citation of these options should not be construed as limiting the combinations taught in the previous section-these citations are hereby incorporated by reference into each of the embodiments below.

A first system embodiment of the disclosed technology includes one or more processors coupled to a memory. The memory is loaded with computer instructions to perform early prediction of base recognition quality during the extended optical base recognition process. The base recognition process includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. Each base recognition process cycle includes (a) a chemical process that attaches additional complementary nucleotides to the target nucleotide strand at millions of positions on the substrate, (b) camera positioning and image registration on a patch of images of the substrate, and (c) image acquisition on the patch of images. When executed on a processor, computer instructions cycle a plurality of time sequences from a pre-prediction base recognition process to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series.

The system trains a convolutional neural network using base recognition quality experience that includes a plurality of time sequences for a pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence. The trained convolutional neural network determines a likely total base recognition quality expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle based on the pre-predicted base recognition process cycle. Finally, the system outputs the total base identification quality for evaluation by the operator.

Embodiments of the system and other systems disclosed optionally include one or more of the following features. The system may also include features described in connection with the disclosed computer-implemented method. For the sake of brevity, alternative combinations of system features are not separately enumerated. Features applicable to the systems, methods, and articles of manufacture are not repeated for the basic feature set of each legal class. The reader will understand how to combine the features identified in this section with the basic features in other legal classes.

The system includes representing chemical process performance by phasing and an estimate of a predetermined phase error in a chemical process subsystem performance time series. The system includes a report of x and y image offset adjustments after image capture in an image registration subsystem performance time series to represent image registration performance. The system also includes representing image acquisition performance by focus and contrast reports in an image acquisition subsystem performance time series. In such embodiments, the system includes representing focus by the narrowness of the full width at half maximum of a single cluster in the cluster image. In another such embodiment of the system, the contrast comprises a minimum contrast calculated as the 10 th percentile for each channel of a list of images. In another such embodiment of the system, the contrast comprises a maximum contrast calculated as the 99.5 th percentile for each channel of a list of images.

In one embodiment of the system, the image acquisition performance further comprises a cluster-intensity image acquisition subsystem performance time series report. In such embodiments, the system reports the cluster intensity at the 90 th percentile of the intensity of the imaged clusters. In one embodiment of the system, the base recognition process comprises 2 to 25 times the post-prediction base recognition process cycle as the pre-prediction cycle. In one embodiment of the system, the base recognition process comprises 20 to 50 pre-prediction base recognition process cycles. In one embodiment of the system, the base recognition process comprises 100 to 500 post-prediction base recognition process cycles.

In one embodiment, during a post-predicted base recognition process cycle, the system determines a likely total base recognition quality for at least five intermediate cycle counts from a prior predicted base recognition process cycle. After the determination, the system outputs an intermediate possible total base call quality determination. In one embodiment of the system, the total base identification mass is calculated as a Phred mass fraction. In another embodiment of the system, the total base identification mass is calculated as a Sanger mass fraction.

A second system embodiment of the disclosed technology includes one or more processors coupled to a memory. The memory is loaded with computer instructions to perform an early prediction of base recognition quality during an extended optical base recognition process comprising sequences read in pairs, each read comprising a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The system includes providing a plurality of time series from a second read of the pre-predicted base recognition process cycle to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series. The system also includes providing the first read total base identification quality time series to a trained convolutional neural network.

The system includes training the convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-predicted base recognition process cycle of the second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read. The trained convolutional neural network uses a pre-predicted base recognition process cycle of the second read and a total base recognition quality time series of the first read to determine a likely total base recognition quality of the second read expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Finally, the system outputs the possible total base identification quality of the second reading for evaluation by the operator. In this embodiment of the system, the first read precedes the second read and comprises base-recognizing the sequenced molecule in the positive direction. The second read involves base recognition of the sequenced molecule in the reverse orientation.

Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to the second system implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.

Other embodiments may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the functions of the system described above. Another embodiment may include a computer-implemented method of performing the functions of the system described above.

A first computer-implemented method embodiment of the disclosed technology includes early prediction of base recognition quality during an extended optical base recognition process. The base recognition process cycle includes a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The method includes providing a plurality of time series pre-prediction base recognition process cycles to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series.

The computer-implemented method further includes training the convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence. The trained convolutional neural network determines a likely total base recognition quality expected after a post-predicted base recognition process cycle that is at least twice as large as a pre-predicted base recognition process cycle of a pre-predicted base recognition process cycle. Finally, the method outputs the possible total base identification quality for operator evaluation.

Each of the features discussed in this particular implementation section of the first system embodiment are equally applicable to this computer-implemented method embodiment. As noted above, not all system features are repeated here and should be considered repeated by reference.

Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform the first computer-implemented method described above. Another embodiment may include a system comprising a memory and one or more processors operable to execute instructions stored in the memory to perform the first computer-implemented method described above.

Computer-readable medium (CRM) embodiments of the disclosed technology include a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed on a processor, implement the above-described computer-implemented method.

Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to the CRM implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.

A second computer-implemented method embodiment of the disclosed technology includes early prediction of base recognition quality during an extended optical base recognition process that includes paired-read sequences. Each read includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The method includes providing a plurality of time series from a second read of the pre-predicted base recognition process cycle to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series. The method also includes providing the first read total base identification quality time series to a trained convolutional neural network.

The computer-implemented method includes training a convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-predicted base recognition process cycle of a second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read. The trained convolutional neural network uses a pre-predicted base recognition process cycle of the second read and a total base recognition quality time series of the first read to determine a likely total base recognition quality of the second read expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Finally, the method outputs a possible total base identification quality of the second read for evaluation by an operator. In a second computer-implemented method embodiment, the first reading precedes the second reading and comprises base-recognizing the sequenced molecule in the positive direction. The second read involves base recognition of the sequenced molecule in the reverse orientation.

Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to this method implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.

Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform the computer-implemented method described above. Another embodiment may include a system comprising a memory and one or more processors operable to execute instructions stored in the memory to perform the computer-implemented method described above.

Computer-readable medium (CRM) embodiments of the disclosed technology include a non-transitory computer-readable storage medium storing computer program instructions that, when executed on a processor, implement the second computer-implemented method described above.

The above description is provided to enable the manufacture and use of the disclosed technology. Various modifications to the disclosed embodiments will be readily apparent, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosed technology. Thus, the disclosed technology is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. The scope of the disclosed technology is defined by the following claims.

29页详细技术资料下载

Predicting quality of sequencing results using deep neural networks

相关技术

网友询问留言