Predicting quality of sequencing results using deep neural networks
阅读说明:本技术 使用深度神经网络预测测序结果的质量 (Predicting quality of sequencing results using deep neural networks ) 是由 A·杜塔 A·起亚 于 2019-01-04 设计创作,主要内容包括:所公开的技术预测在扩展的光学碱基识别过程期间的碱基识别质量。碱基识别过程包括前预测碱基识别过程循环和至少两倍于前预测循环的后预测碱基识别过程循环。将来自前预测碱基识别循环的多个时间序列作为输入提供给经过训练的卷积神经网络。卷积神经网络根据前预测碱基识别过程循环来确定后预测碱基识别过程循环之后预期的可能的总碱基识别质量。当碱基识别过程包括成对读取的序列时,还将第一读取的总碱基识别质量时间序列作为附加输入提供给卷积神经网络,以确定在第二读取的后预测循环之后可能的总碱基识别质量。(The disclosed techniques predict base recognition quality during an extended optical base recognition process. The base recognition process includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. A plurality of time sequences from a previous predicted base recognition cycle are provided as inputs to a trained convolutional neural network. The convolutional neural network determines the expected total likely base recognition quality after the post-predicted base recognition process cycle from the pre-predicted base recognition process cycle. When the base recognition process includes paired reads of sequences, the total base recognition mass time series of the first read is also provided as an additional input to the convolutional neural network to determine the possible total base recognition mass after the post-prediction cycle of the second read.)
1. A computer-implemented method for early prediction of base recognition quality during an extended optical base recognition process comprising a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice the pre-prediction cycle, wherein each base recognition process cycle comprises: (a) chemical processing of target nucleotide strands with additional complementary nucleotides added to millions of positions on a substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch, the method comprising:
inputting a plurality of time series from the pre-prediction base recognition process cycle into a trained convolutional neural network, the plurality of time series comprising a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base recognition quality time series;
wherein the trained convolutional neural network is trained using base recognition quality experience comprising a plurality of time sequences of the pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence;
the trained convolutional neural network determining a likely total base recognition quality expected after a post-prediction base recognition process cycle that is at least twice the pre-prediction cycle, based on the pre-prediction base recognition process cycle; and
outputting the possible total base identification quality for evaluation by an operator.
2. The computer-implemented method of claim 1, wherein chemical processing performance is represented in the chemical processing subsystem performance time series by phasing and an estimate of a predetermined phase error.
3. The computer-implemented method of any of claims 1 to 2, wherein image registration performance is represented in the image registration subsystem performance time series by a report of post-image capture x and y image offset adjustments.
4. The computer-implemented method of any of claims 1 to 3, wherein image acquisition performance is represented in the image acquisition subsystem performance time series by focus and contrast reports.
5. The computer-implemented method of claim 4, wherein the focus is represented by a narrowness of a full width at half maximum of each cluster in the cluster image.
6. The computer-implemented method of claim 4, wherein the contrast comprises a minimum contrast calculated as a 10 th percentile for each channel of a list of images.
7. The computer-implemented method of claim 4, wherein the contrast comprises a maximum contrast calculated as the 99.5 th percentile for each channel of a column of images.
8. The computer-implemented method of claim 4, wherein the image acquisition performance further comprises cluster-intensity image acquisition subsystem performance time series reporting.
9. The computer-implemented method of claim 8, wherein the cluster intensity is reported at the 90 th percentile of the intensity of the imaged clusters.
10. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises a post-prediction base recognition process cycle that is 3 to 25 times as long as a pre-prediction cycle.
11. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises 2 to 50 times a post-prediction base recognition process cycle as a pre-prediction cycle.
12. The computer-implemented method of any of claims 1 to 9, wherein the base recognition process comprises 20 to 50 pre-predicted base recognition process cycles.
13. The computer-implemented method of any one of claims 1 to 9, wherein the base recognition process comprises 100 to 500 post-prediction base recognition process cycles.
14. The computer-implemented method of claim 1, further comprising determining possible total base identification process qualities for at least five intermediate cycle counts from the previous predicted base identification process cycle during the post-predicted base identification process cycle, and outputting the intermediate possible total base identification quality determinations.
15. A computer-implemented method for early prediction of base recognition quality during an extended optical base recognition process, the extended optical base recognition process comprising sequences read in pairs, each read comprising a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice the pre-predicted cycle, each base recognition process cycle comprising: (a) chemical processing of target nucleotide strands with additional complementary nucleotides added to millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch, the method comprising:
inputting into the trained convolutional neural network:
a plurality of time series of the pre-prediction base recognition process cycle from a second read, the plurality of time series comprising a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base recognition quality time series, and
a first read total base identification mass time series;
wherein the trained convolutional neural network is trained using base recognition quality experience comprising a plurality of time sequences of the pre-predicted base recognition process cycle of the second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read;
the trained convolutional neural network determining a likely total base recognition mass of the second read expected after at least twice a post-predicted base recognition process cycle of the pre-prediction cycle from the pre-predicted base recognition process cycle of the second read and a total base recognition mass time series of the first read; and
outputting the possible total base identification quality of the second reading for evaluation by an operator.
16. A system comprising one or more processors coupled to a memory loaded with computer instructions to perform early prediction of base recognition quality during an extended optical base recognition process comprising a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle, wherein each base recognition process cycle comprises: (a) chemical processing of target nucleotide strands with additional complementary nucleotides attached to millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on the patch; the instructions, when executed on the processor, perform operations comprising:
inputting a plurality of time series from the pre-prediction base recognition process cycle into a trained convolutional neural network, the plurality of time series comprising a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base recognition quality time series;
wherein the trained convolutional neural network is trained using base recognition quality experience comprising a plurality of time sequences of the pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence;
the trained convolutional neural network determining a likely total base recognition quality expected after a post-prediction base recognition process cycle that is at least twice the pre-prediction cycle, based on the pre-prediction base recognition process cycle; and
outputting the possible total base identification quality for evaluation by an operator.
17. The system of claim 16 wherein chemical treatment performance is represented in the chemical treatment subsystem performance time series by phasing and an estimate of a predetermined phase error.
18. A non-transitory computer readable medium having computer executable instructions for implementing early prediction of a neural network-based base recognition quality system as claimed in any one of claims 1 to 15.
19. A computer system running on a number of parallel processors adapted to perform the computer-implemented method of any of claims 1 to 15.
Technical Field
The disclosed technology relates to artificial intelligence type computers and digital data processing systems, and corresponding data processing methods and products for intelligent simulation, including machine learning systems and artificial neural networks. In particular, the disclosed techniques relate to analyzing ordered data using deep learning and deep convolutional neural networks.
Background
The subject matter discussed in the background section should not be admitted to be prior art merely as a result of its mention in the background section. Similarly, the problems mentioned in the background section or related to the subject matter of the background section should not be considered as having been previously acknowledged in the prior art. The subject matter in the background section merely represents different approaches that may themselves correspond to implementations of the claimed technology.
Various protocols in biological or chemical research involve performing a large number of controlled reaction cycles. Some DNA sequencing protocols, such as sequencing-by-synthesis (SBS), detect light emissions from a series of reaction sites. In SBS, a plurality of fluorescently labeled nucleotides are used to sequence nucleic acids of a large number of amplified DNA clusters (or clonal populations) located on the surface of a substrate. For example, the surface may define a channel in the flow channel. The nucleic acid sequences in the different clusters are determined by running through hundreds of cycles in which fluorescently labeled nucleotides are added to the clusters, which are then excited by a light source to provide light emission.
Although SBS is an effective technique for determining nucleic acid sequences, SBS operations may take three days or more to complete. Some operations fail due to quality issues. Reliably predicting the final quality of a sequencing run over several cycles would be beneficial to users of sequencing instruments, allowing them to stop failing runs after half a day or less. The operator of the sequencing instrument cannot predict in advance the final quality of the sequencing run.
Fortunately, a large amount of subsystem performance data has been collected for performing troubleshooting. This subsystem data can be combined and used to predict the total base identification quality at the end of a sequencing read or run, and at intervals during the read. By using subsystem performance indicators reported early in the run, the trained deep neural network can predict the likely total base recognition quality.
Drawings
The drawings are included for illustrative purposes and are used only to provide examples of possible structures and process operations of one or more embodiments of the present disclosure. These drawings in no way limit any changes in form and detail that may be made to the disclosure by one skilled in the art without departing from the spirit and scope of the disclosure. A more complete understanding of the subject matter may be derived by referring to the detailed description and claims when considered in conjunction with the following figures, wherein like reference numbers refer to similar elements throughout the figures.
Fig. 1 shows an architecture level schematic of a system in which a machine learning system including a quality prediction convolutional neural network predicts the overall base recognition quality of sequencing data generated by a sequencing system.
Figure 2 shows the subsystem performance and total base identity quality data stored per cycle in the sequencing quality database of figure 1.
FIG. 3 illustrates the processing of an input having one channel by different layers of the quality prediction convolutional neural network of FIG. 1.
FIG. 4 illustrates the processing of an input having four channels by different layers of the quality prediction convolutional neural network of FIG. 1.
FIG. 5 shows an example of subsystem performance data and total base identity quality data stored in the sequencing quality database of FIG. 1.
Figure 6 shows a graphical representation of total base identity quality data for two reads of an example sequencing run.
Fig. 7 shows total base identification quality data for two reads of two example sequencing runs, indicating the predicted total base identification quality in different target cycles.
FIG. 8 shows example data for predicted and true total base identification quality data within a target cycle, and a graph of a comparison of validation data and test data within an intermediate target cycle.
Fig. 9 shows an example of an architecture level schematic of the quality prediction convolutional neural network of fig. 1 in training and production.
Fig. 10 is a simplified block diagram of a computer system that may be used to implement the machine learning system of fig. 1.
Detailed Description
The following detailed description is made with reference to the accompanying drawings. Example embodiments are described to illustrate the disclosed technology and not to limit its scope (as defined by the claims). Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows.
Introduction to
The quality of base recognition is a measure of the success or failure of nucleotide sequencing in a DNA or RNA molecule. Sequencing-by-synthesis (SBS) is a sequencing technique that involves adding complementary nucleotides one at a time to a nucleotide sequence fragment of the DNA to be sequenced. Optical platforms using SBS can sequence billions of clusters of nucleotide sequence fragments (sometimes referred to as molecules) arranged in multiple lanes, each lane having a small block, on a slide or flow cell. Molecular clusters represent clones of molecules. The cloned molecule amplifies the signal generated during SBS.
Sequencing nucleotides in a molecule requires hundreds of cycles. The clonal clusters are ready for the SBS process before the cycle begins. In one cycle, there are chemical operations, image capture operations, and image processing operations. The chemical manipulation is designed to add one dye-labeled complementary nucleotide per molecule in each cluster per cycle. When a molecule falls behind or exceeds SBS relative to other molecules within its cluster, it loses phase (out of phase), referred to as phasing (phasing) or pre-phasing. The image capture operation involves aligning the camera with a tile in the track, illuminating the tile, and capturing one to four images. Image processing results in base recognition, which means that complementary nucleotides added to the molecules in the cluster are recognized in one cycle. Dye chemistry, illumination, camera design and the number of images captured vary across sequencing platforms. The sequencing instrument may provide subsystem performance metrics for chemistry, camera positioning or registration, image capture or acquisition, and overall base identification quality.
Sequencing of 350 nucleotide molecules by SBS can involve 300 or more processing cycles in flight. The run was divided into two reads starting from the 3 'and 5' ends of the same sequence fragment. When the number of cycles is less than the length of the molecule, an un-sequenced region will remain in the middle of the molecule after reading from the 3 'and 5' ends is complete.
Sequencing the human genome requires parallel sequencing of many molecules of DNA fragments, since the human genome comprises approximately 30 hundred million base pairs. These base pairs are organized in 23 pairs of human chromosomes that replicate in each cell. The 300 cycles of combining partial sequences into a whole genome and subsequent processing may take 3 days or more to complete. Some operations fail due to quality issues. Reliably predicting the final quality of a sequencing run over several cycles would be beneficial to users of sequencing instruments, allowing them to stop failing runs after half a day or less.
The operator of the sequencing instrument cannot predict in advance the final quality of the sequencing run. Fortunately, a large amount of subsystem performance data has been collected for performing troubleshooting. This subsystem data can be combined and used to predict the total base identification quality at the end of a sequencing read or run, and at intervals during the read. By using subsystem performance indicators reported early in the run, the trained deep neural network can predict the likely total base recognition quality.
When the run includes two reads from both ends of the molecule, a similar, even earlier prediction can be made for the second read. Since the second read immediately follows the first read, data late in the first read may be combined with data early in the second read. This may significantly reduce the number of cycles required for the second read. For example, if the subsystem performance data is used for 25 cycles during the first read, it is sufficient to combine only 5 cycles of data at the second read with 20 cycles of data at the first read. Separate predictions of the quality of the first and second readings may be made.
Environment(s)
We describe a system for early prediction of base recognition quality in an extended optical base recognition process. There are four nucleotides in a DNA molecule-adenine (A), cytosine (C), guanine (G) and thymine (T). Base recognition refers to the process of determining the nucleotide bases (a, C, G, T) of each cluster of DNA molecules in one cycle of a sequencing run. The system is described with reference to fig. 1, which shows an architecture level schematic of the system according to one embodiment. Since fig. 1 is an architectural diagram, certain details are intentionally omitted to improve clarity of the description. The discussion for fig. 1 is arranged as follows. First, elements of the figure are described, followed by a description of their interconnections. The use of elements in the system is then described in more detail.
Fig. 1 includes a
In SBS, a laser is used to irradiate dye-labeled complementary nucleotides attached to each molecule in each cluster in each cycle. The camera takes an image of the patch, which is then processed to identify the nucleotides attached to the molecules in the cluster (A, C, G, T). Some sequencing systems use four channels to identify the four nucleotides attached to the molecule in each cycle (A, C, G, T). In such a system, four images are generated, each containing signals having a single different color for each image. These four colors correspond to the four possible nucleotides present at a particular position. In another sequencing system, dual channels are used to identify four nucleotides (A, C, G, T). In such a system, two images are taken per cycle. A first nucleotide type is detected in the first channel, a second nucleotide type is detected in the second channel, a third nucleotide type is detected in both the first and second channels, and a fourth nucleotide type lacking a dye-tagged label is not or minimally detected in both channels.
The
In one embodiment of the
In another embodiment of the
A trained quality prediction convolutional
Sequencing quality data
Fig. 2 shows the sequencing quality indicators 213 stored in the
Sequencing quality data 219 for each cycle shows subsystem performance indicators at a lower abstraction level. Chemical processing
The term "prephasing" refers to the situation where one molecule is at least one base earlier than the other molecules in the same cluster. One reason for the pre-phasing is to incorporate one non-terminated nucleotide followed by a second nucleotide in the same sequencing cycle. The sequencing quality data 219 includes a phasing metric and a predetermined phasing metric for a previous prediction cycle of a read. In an embodiment of
In embodiments, the
Alternatively, the second cluster formation technique used by sequencing
During each cycle of the SBS technique, four complementary nucleotides (A, C, G, T) are simultaneously transferred to molecular clusters on the patches arranged in lanes on the flow cell. Each nucleotide has a spectrally different tag attached to it. A laser is used to irradiate dye-labeled complementary nucleotides attached to each molecule in each cluster in each cycle. The camera takes an image of the patch, which is then processed to identify the nucleotides attached to the molecules in the cluster (A, C, G, T). Some sequencing systems use four channels to identify the four types of nucleotides attached to the molecule per cycle (A, C, G, T). In such a system, four images are generated, each containing signals having a single different color for each image. These four colors correspond to the four possible nucleotides present at a particular position. Four images are then obtained, each using a detection channel selective to one of four different labels. The identified tags are then used to identify bases for each cluster. In this embodiment, the "x" offset of the loop "n" adjusts R n1And "y" offset adjusts R n2Is provided as input to the machine learning system as one per channel.
In another type of sequencing system, dual channels are used to identify four complementary nucleotides attached to a molecule (A, C, G, T). In such a system, two images are taken per cycle. A first nucleotide type is detected in the first channel, a second nucleotide type is detected in the second channel, a third nucleotide type is detected in both the first and second channels, and a fourth nucleotide type lacking a dye-tagged label is not or minimally detected in both channels. As described above, the sequencing quality data 219 includes image registration subsystem performance data for the first "n" cycles (also referred to as the read pre-prediction cycles of the sequencing run).
Image of a personAcquiring subsystem performance data includes the focus score A for the first "n" cycles of a sequencing run n1Minimum contrast metric a n2Maximum contrast metric A n3And a strength metric A n4. The focus fraction is defined as the mean full width at half maximum (FWHM) of the molecular clusters, expressed in pixels as their approximate size. The minimum and maximum contrast values are the 10 th and 95.5 th percentiles, respectively, of each channel of the selected column of the original image. The selected column may be a particular patch or lane of flow cells. The process of determining an intensity value for each cluster in the template of a given sequencing image is referred to as intensity extraction. To extract the intensity, a background of the clusters is calculated using a portion of the image containing the clusters. The background signal is subtracted from the signal of the cluster to determine the intensity. The intensity extracted from the 90 th percentile of the data is stored in the sequencing quality data 219. Image acquisition subsystem performance data for the first "n" previous prediction cycles of the read sequencing run is stored in the sequencing quality database 219. In one embodiment, each image acquisition subsystem performance data value includes four values corresponding to the four channels discussed above.
The total base
The quality of operation can be measured by% Q>Higher% Q determined by a value of 30>The 30 values represent a higher number of bases that can be reliably used for downstream data analysis. Each sequencing System of Illumina Inc. has the expected% Q>30 specification. For example, for HiSeqX
TMThe system, on average, greater than or equal to 75% of the bases are expected to be higher than Q30 for sequencing reads that are 150 nucleotide (also referred to as base) paired-end reads. In one embodiment of the
Quality prediction convolutional neural network
Fig. 3 shows the layers of the quality prediction Convolutional Neural Network (CNN)300 of fig. 1. FIG. 3 is an embodiment with two convolutional layers. The network may have one to five convolutional layers. In other embodiments, the network may have more than five convolutional layers. One way to analyze the output of the convolution is through a Fully Connected (FC) network. Thus, at the last layer of the quality prediction CNN, the output of the convolutional layer is provided to the FC network. The fully-connected layer may be implemented as a multilayer sensor having two to five layers. One output from the FC network can be used to predict the likely total base identification quality expected in a particular target cycle in a post-prediction loop of reads. In an embodiment of such a system, a separate machine learning system is trained to predict the likely total base identification quality expected in each target cycle. In an alternative embodiment of the machine learning system, a plurality of outputs from the FC network may be used to predict the total base identification quality expected in a plurality of post-prediction target cycles.
In fig. 3, the dimensions of the input of each layer of the quality prediction CNN are shown in parentheses. As mentioned above, some inputs of the quality prediction CNN have one channel, while other inputs may have four channels. The example quality prediction CNN shown in fig. 3 is used for an input with one channel. The dimension of the input time series indicates that there are 25 inputs, each input value comprising a one-dimensional value (311). This input can be thought of as a one-dimensional vector containing 25 real numbers. These 25 values correspond to a particular subsystem performance. For example, a chemical processing subsystem performance time series or a total base identification mass time series. As described above, there is one channel per cycle for both inputs. Each input is subjected to an independent convolution. The input is then passed through a batch normalization layer at block 321.
In a Convolutional Neural Network (CNN), the distribution of each layer changes during training, and the distribution of each layer varies from layer to layer. This reduces the convergence speed of the optimization algorithm. Batch normalization (Ioffe and szegdy 2015) is a technique to solve this problem. The input to the batch normalization layer is denoted by x, the output is denoted by z, and the batch normalization applies the following transformation to x:
batch normalization applies mean variance normalization to input x using μ and σ and linear scales and transforms it using γ and β.
The output from batch normalization layer 321 is provided as input to convolutional layer 331. Batch normalization does not change the dimension of the input. In the example of the convolutional layer shown in fig. 3, 64 filters of width 5 and
The output of the first convolution layer 331 includes 25 values, each having 64 lanes and a width. The output of the convolution is also referred to as a feature map. The output is provided as an input to max pooling layer 343. The goal of the pooling layer is to reduce the dimensionality of the feature map. Therefore, it is also called "downsampling". The factor by which the downsampling is to be performed is called the "step size" or "downsampling factor". The pooling step is denoted by "s". In one type of pooling, referred to as "maximum pooling", a maximum value is selected for each step. For example, consider applying the maximum pooling with s-2 to a 12-dimensional vector x-1, 10,8, 2, 3, 6, 7, 0, 5, 4, 9, 2. The largest pooling vector x with step s of 2 means that we select the largest value out of every two values starting with
The output of max pooling layer 343 is passed through batch normalization layer 347 before being provided as input to the next convolution layer 351. In the convolutional layer, 64 kernels of size 5 by 1(5 × 1) are convolved in 64 dimensions, respectively, to generate an output feature map of size 64 by 12(64 × 12). The summation operation is performed over 64 dimensions to generate a feature map of
The discarding method is a simple and effective technique to prevent overfitting of the neural network. Its working principle is to randomly discard a small fraction of neurons from the network in each iteration of training. This means that the output and gradient of the selected neurons are set to zero, so they do not have any effect on the forward and backward propagation. In the example of the quality prediction convolutional neural network shown in fig. 3, discarding is performed using a probability of 0.3 before the second and third
Fig. 4 shows the architecture of an example quality prediction Convolutional Neural Network (CNN)400, which is similar to that shown in fig. 3, but is designed for an input having four channels. As described above, the image registration subsystem performance time series and the image acquisition subsystem performance time series are composed of four channels of data. These channels may correspond to four nucleotides (A, C, G, T). In one embodiment of quality prediction CNN, the four channels of each input are combined together before the quality prediction CNN processes the inputs. In another embodiment of the network, the quality prediction CNN takes the input of four channels and produces the output of four channels corresponding to the input. The four channels of each output value from the quality prediction CNN are added to obtain a value of one channel. In both embodiments, a summation operation of adding values in four passes for each input value may be used. In the example network shown in fig. 4, the convolution filter convolves on an input having four channels.
Input 411 includes 25 values corresponding to 25 previous prediction cycles in a read of a sequencing run. Each of the 25 input values is 1 in size and has four channels. At block 421, batch normalization is performed on the input. At block 431, a padding convolution with two zero padding is performed. Four kernels of size 5 by 1(5 × 1) are convolved over four channels, generating a feature map of size 4 by 25(4 × 25). The summation operation is performed over four dimensions to generate a feature map of
At block 451, a second convolution is performed using 128 filters of size 5. The second convolution convolves the filter over the input with two zero-padding on each side. The output of the second convolution includes 128 size 12 feature maps, as shown in block 461. Maximum pooling at step s of 2 reduces the dimension to 128 feature maps of size 6 at block 463. At
Examples of subsystem Performance data
Fig. 5 shows example data 500 of chemical
Base recognition quality prediction result analysis
Fig. 6 includes a
In one embodiment, the first 25 cycles of
Figure 7 shows the total base
In an embodiment, an integration consisting of three trained quality prediction Convolutional Neural Networks (CNN)179 is used during production (also referred to as "inference") to predict the likely total base recognition quality of the target cycle. According to one embodiment, each of the three models is run 100 times to generate as many predicted values. Then, the average value of the total 300 predicted values generated by the three quality predictions CNN is used as the final prediction result. The standard deviation of the predicted values is used as confidence interval. Reads with a total base identification quality value close to the training data may have a lower uncertainty or shorter confidence interval. Prediction results that are far from the training examples may have high uncertainty.
During training, according to one embodiment, the total base call
Using
Fig. 8 shows in graph 811 a
Determining the coefficient to be represented as R
2And is the ratio of the variance in the dependent variable predicted from the independent variable. It is a statistical measure of the proximity of the predicted data to the true data points. R
2A "1" indicates that the regression data completely fit the real data.
Training and reasoning for quality-predictive convolutional neural networks
FIG. 9 shows schematic diagrams 911 and 961 of a training and
The trained quality prediction CNNs are deployed into a production environment where they receive production data for pre-prediction cycles read in a sequencing run of the
Computer system
FIG. 10 is a simplified block diagram of a
In one embodiment, the
The user
User
Storage subsystem 1010 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed by the
The
The memory subsystem 1022 used in the storage subsystem 1010 may include a number of memories including a main Random Access Memory (RAM)1032 for storing instructions and data during program execution and a Read Only Memory (ROM)1034 where fixed instructions are stored.
Bus subsystem 1055 provides a mechanism for allowing the various components and subsystems of
The
Description of the preferred embodiments
The disclosed technology relates to early prediction of base recognition quality during extended optical base recognition processes.
The disclosed technology may be implemented as a system, method, or article of manufacture. One or more features of one embodiment may be combined with the basic embodiment. Non-mutually exclusive embodiments are taught as combinable. One or more features of one embodiment may be combined with other embodiments. The present disclosure alerts the user of these options periodically. The omissions in some embodiments from the citation of these options should not be construed as limiting the combinations taught in the previous section-these citations are hereby incorporated by reference into each of the embodiments below.
A first system embodiment of the disclosed technology includes one or more processors coupled to a memory. The memory is loaded with computer instructions to perform early prediction of base recognition quality during the extended optical base recognition process. The base recognition process includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. Each base recognition process cycle includes (a) a chemical process that attaches additional complementary nucleotides to the target nucleotide strand at millions of positions on the substrate, (b) camera positioning and image registration on a patch of images of the substrate, and (c) image acquisition on the patch of images. When executed on a processor, computer instructions cycle a plurality of time sequences from a pre-prediction base recognition process to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series.
The system trains a convolutional neural network using base recognition quality experience that includes a plurality of time sequences for a pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence. The trained convolutional neural network determines a likely total base recognition quality expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle based on the pre-predicted base recognition process cycle. Finally, the system outputs the total base identification quality for evaluation by the operator.
Embodiments of the system and other systems disclosed optionally include one or more of the following features. The system may also include features described in connection with the disclosed computer-implemented method. For the sake of brevity, alternative combinations of system features are not separately enumerated. Features applicable to the systems, methods, and articles of manufacture are not repeated for the basic feature set of each legal class. The reader will understand how to combine the features identified in this section with the basic features in other legal classes.
The system includes representing chemical process performance by phasing and an estimate of a predetermined phase error in a chemical process subsystem performance time series. The system includes a report of x and y image offset adjustments after image capture in an image registration subsystem performance time series to represent image registration performance. The system also includes representing image acquisition performance by focus and contrast reports in an image acquisition subsystem performance time series. In such embodiments, the system includes representing focus by the narrowness of the full width at half maximum of a single cluster in the cluster image. In another such embodiment of the system, the contrast comprises a minimum contrast calculated as the 10 th percentile for each channel of a list of images. In another such embodiment of the system, the contrast comprises a maximum contrast calculated as the 99.5 th percentile for each channel of a list of images.
In one embodiment of the system, the image acquisition performance further comprises a cluster-intensity image acquisition subsystem performance time series report. In such embodiments, the system reports the cluster intensity at the 90 th percentile of the intensity of the imaged clusters. In one embodiment of the system, the base recognition process comprises 2 to 25 times the post-prediction base recognition process cycle as the pre-prediction cycle. In one embodiment of the system, the base recognition process comprises 20 to 50 pre-prediction base recognition process cycles. In one embodiment of the system, the base recognition process comprises 100 to 500 post-prediction base recognition process cycles.
In one embodiment, during a post-predicted base recognition process cycle, the system determines a likely total base recognition quality for at least five intermediate cycle counts from a prior predicted base recognition process cycle. After the determination, the system outputs an intermediate possible total base call quality determination. In one embodiment of the system, the total base identification mass is calculated as a Phred mass fraction. In another embodiment of the system, the total base identification mass is calculated as a Sanger mass fraction.
A second system embodiment of the disclosed technology includes one or more processors coupled to a memory. The memory is loaded with computer instructions to perform an early prediction of base recognition quality during an extended optical base recognition process comprising sequences read in pairs, each read comprising a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The system includes providing a plurality of time series from a second read of the pre-predicted base recognition process cycle to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series. The system also includes providing the first read total base identification quality time series to a trained convolutional neural network.
The system includes training the convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-predicted base recognition process cycle of the second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read. The trained convolutional neural network uses a pre-predicted base recognition process cycle of the second read and a total base recognition quality time series of the first read to determine a likely total base recognition quality of the second read expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Finally, the system outputs the possible total base identification quality of the second reading for evaluation by the operator. In this embodiment of the system, the first read precedes the second read and comprises base-recognizing the sequenced molecule in the positive direction. The second read involves base recognition of the sequenced molecule in the reverse orientation.
Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to the second system implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.
Other embodiments may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the functions of the system described above. Another embodiment may include a computer-implemented method of performing the functions of the system described above.
A first computer-implemented method embodiment of the disclosed technology includes early prediction of base recognition quality during an extended optical base recognition process. The base recognition process cycle includes a pre-predicted base recognition process cycle and a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The method includes providing a plurality of time series pre-prediction base recognition process cycles to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series.
The computer-implemented method further includes training the convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-prediction base recognition process cycle and a post-prediction total base recognition quality time sequence. The trained convolutional neural network determines a likely total base recognition quality expected after a post-predicted base recognition process cycle that is at least twice as large as a pre-predicted base recognition process cycle of a pre-predicted base recognition process cycle. Finally, the method outputs the possible total base identification quality for operator evaluation.
Each of the features discussed in this particular implementation section of the first system embodiment are equally applicable to this computer-implemented method embodiment. As noted above, not all system features are repeated here and should be considered repeated by reference.
Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform the first computer-implemented method described above. Another embodiment may include a system comprising a memory and one or more processors operable to execute instructions stored in the memory to perform the first computer-implemented method described above.
Computer-readable medium (CRM) embodiments of the disclosed technology include a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed on a processor, implement the above-described computer-implemented method.
Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to the CRM implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.
A second computer-implemented method embodiment of the disclosed technology includes early prediction of base recognition quality during an extended optical base recognition process that includes paired-read sequences. Each read includes a pre-prediction base recognition process cycle and a post-prediction base recognition process cycle that is at least twice as large as the pre-prediction cycle. Each cycle of the base recognition process comprises: (a) chemical processing of additional complementary nucleotides onto the target nucleotide strand at millions of positions on the substrate, (b) camera localization and image registration on a patch of the substrate, and (c) image acquisition on a patch of the image. The method includes providing a plurality of time series from a second read of the pre-predicted base recognition process cycle to a trained convolutional neural network. The plurality of time series includes a chemical processing subsystem performance time series, an image registration subsystem performance time series, an image acquisition subsystem performance time series, and a total base identification quality time series. The method also includes providing the first read total base identification quality time series to a trained convolutional neural network.
The computer-implemented method includes training a convolutional neural network using a base recognition quality experience that includes a plurality of time sequences for a pre-predicted base recognition process cycle of a second read, a post-predicted total base recognition quality time sequence of the second read, and a total base recognition quality time sequence of the first read. The trained convolutional neural network uses a pre-predicted base recognition process cycle of the second read and a total base recognition quality time series of the first read to determine a likely total base recognition quality of the second read expected after a post-predicted base recognition process cycle that is at least twice as large as the pre-predicted cycle. Finally, the method outputs a possible total base identification quality of the second read for evaluation by an operator. In a second computer-implemented method embodiment, the first reading precedes the second reading and comprises base-recognizing the sequenced molecule in the positive direction. The second read involves base recognition of the sequenced molecule in the reverse orientation.
Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to this method implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.
Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform the computer-implemented method described above. Another embodiment may include a system comprising a memory and one or more processors operable to execute instructions stored in the memory to perform the computer-implemented method described above.
Computer-readable medium (CRM) embodiments of the disclosed technology include a non-transitory computer-readable storage medium storing computer program instructions that, when executed on a processor, implement the second computer-implemented method described above.
Each of the features discussed in this particular implementation section of the first system implementation are equally applicable to the CRM implementation. As noted above, not all system features are repeated here and should be considered repeated by reference.
The above description is provided to enable the manufacture and use of the disclosed technology. Various modifications to the disclosed embodiments will be readily apparent, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosed technology. Thus, the disclosed technology is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. The scope of the disclosed technology is defined by the following claims.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:胃肠外营养诊断系统、设备和方法