Sequencing result analysis method and system, computer readable storage medium and electronic equipment

文档序号:193419 发布日期:2021-11-02 浏览:26次 中文

阅读说明:本技术 测序结果分析方法、系统及计算机可读存储介质和电子设备 (Sequencing result analysis method and system, computer readable storage medium and electronic equipment ) 是由 金欢 樊济才 陈方 孙雷 于 2020-08-25 设计创作,主要内容包括:本发明提供了一种测序结果分析方法。该方法包括:所述测序结果包括第一测序数据和第二测序数据,其中,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段,所述测序结果分析方法包括:(a)基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。(The invention provides a sequencing result analysis method. The method comprises the following steps: the sequencing result comprises first sequencing data and second sequencing data, wherein the first sequencing data and the second sequencing data are both composed of a plurality of reads, at least one part of the reads in the first sequencing data has corresponding reads in the second sequencing data, and the sequencing result analysis method comprises the following steps: (a) performing a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data to obtain final sequence information.)

1. A method for analyzing a sequencing result,

the sequencing result comprises first sequencing data and second sequencing data, wherein the first sequencing data and the second sequencing data are both composed of a plurality of reads, at least a part of the reads in the first sequencing data have corresponding reads in the second sequencing data,

the sequencing result analysis method comprises the following steps:

(a) performing a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data to obtain final sequence information.

2. Method according to claim 1, characterized in that said mutual correction comprises the following steps:

selecting high quality reads and corresponding reads of the high quality reads in the first sequencing data and the second sequencing data, the reads not being less than a predetermined length, the reads having a sequencing quality not less than a predetermined quality threshold; and

and comparing the high-quality reads with corresponding reads of the high-quality reads, and correcting sequence information based on the comparison result.

3. The method of claim 1, wherein step (a) further comprises:

(a-1) constructing a first set of reads based on the first sequencing data according to lengths of the reads, each read length in the first set of reads being no less than a first predetermined length;

(a-2) constructing a second read set and a third read set based on the first read set according to the lengths of the corresponding reads, wherein the length of the corresponding read of each read in the second read set is not lower than a second preset length, and the length of the corresponding read of each read in the third read set is within a preset length range;

(a-3) according to the sequencing quality of the reads in the second read set and the corresponding reads thereof, constructing a fourth read set and a fifth read set based on the second read set and the corresponding reads thereof, wherein the fourth read set and the fifth read set are respectively determined according to the following principles:

comparing the reads in the second set of reads to the sequencing quality of their corresponding reads,

selecting the element of the fourth read set having a higher sequencing quality, and selecting the element of the fifth read set having a lower sequencing quality,

for the case of the same sequencing quality, selecting the reads from the second set of reads as elements of the fourth set of reads, and selecting the corresponding reads as elements of the fifth set of reads;

(a-4) filtering the fourth set of reads using sequencing quality to construct a sixth set of reads, none of the reads in the sixth set of reads having a sequencing quality below a first predetermined quality threshold;

(a-5) selecting the reads corresponding to the reads in the sixth set of reads from the fifth set of reads using the sixth set of reads to construct a seventh set of reads;

(a-6) read aligning the sixth set of reads with the seventh set of reads and determining a first site of difference on the reads of the sixth set of reads; and

(a-7) correcting the first differential site to determine first sequence information using a predetermined sequencing error prediction model for determining a probability of an insertion or deletion of the differential site during sequencing.

4. The method of claim 3, further comprising:

(a-4a) filtering the third set of reads using sequencing quality to construct an eighth set of reads, wherein none of the reads in the eighth set of reads have sequencing quality below a second predetermined quality threshold;

(a-5a) selecting the reads from the second sequencing data that correspond to the reads in the seventh set of reads using the eighth set of reads to construct a ninth set of reads;

(a-6a) read aligning the eighth set of reads with the ninth set of reads and determining a second site of difference on the reads of the eighth set of reads;

(a-7a) correcting the second difference site using the sequencing error prediction model to determine second sequence information.

5. The method of claim 3 or 4, wherein the sequencing error prediction model is obtained by training a naive Bayes model based on the alignment of the first and second sequencing data to a reference genome.

6. The method according to claim 3 or 4, wherein for the first and second differential sites:

if a read from the sixth set of reads has a base at the difference site, a corresponding read from the seventh set of reads has no base at the difference site, and the probability of a deletion at the difference site is 50% or more, retaining the base of the read of the sixth set of reads at the difference site as a final sequencing result;

if the reads from the sixth set of reads do not have a base at the difference site, the reads from the seventh set of reads have a base at the difference site, and the probability of insertion at the difference site is 50% or more, retaining the base of the reads of the sixth set of reads at the difference site as a final sequencing result; and

selecting bases of reads from the sixth set of reads at the differential site as final sequencing results if bases are present at the differential site for reads from the sixth set of reads and bases are also present at the differential site for reads from the seventh set of reads;

optionally, the first predetermined length and the second predetermined length are each independently not less than 20bp, preferably not less than 25 bp;

optionally, the predetermined length range is 10-25 bp;

optionally, the first predetermined quality threshold and the second predetermined quality threshold are each independently not lower than 50, preferably not lower than 60.

7. The method of any one of claims 1-6, wherein the sequencing result is obtained by:

(1) performing a first sequencing on a sequencing template on a chip surface to obtain first sequencing data by forming a first nascent sequencing strand, the sequencing template being attached to the chip surface by a sequencing linker;

(2) subjecting at least a portion of the 3' -end of the first nascent sequencing strand to a first blocking treatment; and

(3) performing a second sequencing of the sequencing template to obtain second sequencing data by forming a second nascent sequencing strand;

optionally, step (2) comprises: removing the first nascent sequencing strand on the surface of the chip, and performing first blocking treatment on the 3' -end of the first nascent sequencing strand remained on the surface of the chip;

optionally, prior to step (1), comprising:

(1-a) hybridizing library molecules in the sequencing library with sequencing adaptors on the surface of the chip;

(1-b) forming the sequencing template by synthesizing complementary strands using the library molecules as an initial template;

(1-c) removing the initial template and performing a second blocking process on the 3' -end of the nucleic acid molecule on the surface of the chip;

optionally, before (1-c), further comprising:

(1-b-1) subjecting the 3' -end of the complementary strand incompletely extended in the step (1-b) to a third blocking treatment;

optionally, the first blocking treatment, the second blocking treatment and the third blocking treatment are each independently performed by linking a 3' -terminal hydroxyl group to an extension reaction blocker;

optionally, the extension reaction blocker is ddNTP or a derivative thereof;

optionally, the first blocking treatment, the second blocking treatment, and the third blocking treatment are each independently performed with at least one of a DNA polymerase and a terminal transferase;

optionally, the first blocking treatment and the third blocking treatment are each independently linked to the ddNTP or derivative thereof by a polymerase, and the second blocking treatment is linked to the ddNTP or derivative thereof by the terminal transferase.

8. A sequencing result analysis system, comprising:

the sequencing device is suitable for obtaining a sequencing result through a double sequencing method, wherein the sequencing result comprises first sequencing data and second sequencing data, the first sequencing data and the second sequencing data are both composed of a plurality of reads, and at least one part of the reads in the first sequencing data have corresponding reads in the second sequencing data;

an analysis device comprising a correction module adapted to perform a mutual correction based on at least a portion of each of the first and second sequencing data in order to obtain final sequence information;

optionally, the correction module is adapted for the steps of:

selecting high quality reads and corresponding reads of the high quality reads in the first sequencing data and the second sequencing data, the reads not being less than a predetermined length, the reads having a sequencing quality not less than a predetermined quality threshold; and

comparing the high-quality reads with corresponding reads of the high-quality reads, and correcting sequence information based on the comparison result;

optionally, the correction module further comprises:

a first read set determining unit, configured to construct a first read set based on the first sequencing data according to the lengths of the reads, where the length of each read in the first read set is not less than a first predetermined length;

a second read set and third read set determining unit, configured to construct a second read set and a third read set based on the first read set according to the lengths of the corresponding reads, where the length of the corresponding read of each read in the second read set is not less than a second predetermined length, and the length of the corresponding read of each read in the third read set is within a predetermined length range;

a fourth read set and fifth read set determining unit, configured to construct a fourth read set and a fifth read set based on the second read set and the corresponding reads thereof according to the sequencing quality of the reads and the corresponding reads thereof in the second read set, where the fourth read set and the fifth read set are determined according to the following principles:

comparing the reads in the second set of reads to the sequencing quality of their corresponding reads,

selecting the element of the fourth read set having a higher sequencing quality, and selecting the element of the fifth read set having a lower sequencing quality,

for the case of the same sequencing quality, selecting the reads from the second set of reads as elements of the fourth set of reads, and selecting the corresponding reads as elements of the fifth set of reads;

a sixth read set determining unit, configured to perform filtering processing on the fourth read set by using sequencing quality so as to construct a sixth read set, where the sequencing quality of the reads in the sixth read set is not lower than a first predetermined quality threshold;

a seventh read set determining unit that selects, from the fifth read set, the read corresponding to the read in the sixth read set, using the sixth read set, so as to construct a seventh read set;

a first differential site determining unit, configured to perform read comparison on the sixth read set and the seventh read set, and determine a first differential site on the reads of the sixth read set; and

the first sequence information determining unit is used for correcting the first differential site by utilizing a predetermined sequencing error prediction model so as to determine first sequence information, and the sequencing error prediction model is used for determining the probability of insertion or deletion of the differential site in the sequencing process;

optionally, the sequencing result analysis system further comprises:

an eighth read set determining unit, configured to perform filtering processing on the third read set by using sequencing quality so as to construct an eighth read set, where the sequencing quality of the reads in the eighth read set is not lower than a second predetermined quality threshold;

a ninth read set determining unit that selects, from the second sequencing data, the reads corresponding to the reads in the seventh read set, using the eighth read set, so as to construct a ninth read set;

a second differential site determining unit, configured to perform read comparison on the eighth read set and the ninth read set, and determine a second differential site on the reads of the eighth read set;

a second sequence information determination unit for correcting the second differential site by using the sequencing error prediction model to determine second sequence information;

optionally, the first predetermined length and the second predetermined length are each independently not less than 20bp, preferably not less than 25 bp;

optionally, the predetermined length range is 10-25 bp;

optionally, the first predetermined quality threshold and the second predetermined quality threshold are each independently not lower than 50, preferably not lower than 60.

Optionally, the sequencing result analysis system further comprises: a sequencing error prediction model construction module adapted to train a naive Bayes model based on comparison results of the first sequencing data and the second sequencing data with a reference genome so as to obtain the sequencing error prediction model;

optionally, for the first and second differential sites:

if a read from the sixth set of reads has a base at the difference site, a corresponding read from the seventh set of reads has no base at the difference site, and the probability of a deletion at the difference site is 50% or more, retaining the base of the read of the sixth set of reads at the difference site as a final sequencing result;

if the reads from the sixth set of reads do not have a base at the difference site, the reads from the seventh set of reads have a base at the difference site, and the probability of insertion at the difference site is 50% or more, retaining the base of the reads of the sixth set of reads at the difference site as a final sequencing result; and

selecting bases of reads from the sixth set of reads at the differential site as final sequencing results if bases are present at the differential site for reads from the sixth set of reads and bases are also present at the differential site for reads from the seventh set of reads;

optionally, said first predetermined length and said second predetermined length are each independently not less than 20bp, preferably not less than 25bp,

the predetermined length range is 10-25 bp;

said first predetermined quality threshold and said second predetermined quality threshold are each independently not lower than 50, preferably not lower than 60.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

the computer-readable storage medium recited in claim 9; and

one or more processors to execute the program in the computer-readable storage medium.

Technical Field

The present invention relates to the field of bioinformatics, and in particular, to a sequencing result analysis method, a sequencing result analysis system, a computer-readable storage medium, and an electronic device.

Background

The concept of single-molecule sequencing was proposed in the eighties of the last century, the first single-molecule DNA sequencing experiment was successfully demonstrated by doctor Stephen Quake, a professor of the bioengineering system of stanford university, 2003, the first single-molecule sequencer (helliscope) of Helicos corporation, 2008, the article published by Korlach and Turner in journal of science, introducing the principles of the PacBio single-molecule sequencing technology, and subsequently, the PacBio RS sequencing system was introduced by PacBio corporation, 2010, and the MinION sequencing system was shown by Oxford Nanopore corporation, AGBT (annual meeting of genome biotechnology evolution), which was commercially available in 2011 and 2014. However, the Single-plex sequencing (Single-Pass) error rate was reported to be high, up to 30%, for all of the platforms, i.e., Helicos, PacBio and MinION sequencing. Many studies show that the error types of the above sequencing platform are mainly InDel and occur randomly, and the sequencing error rate can be reduced by a repeated reading method.

There is a literature report that PacBio can overcome the high error rate problem of its SMRT sequencing technology using CCS (circular consensus sequence). In addition, the MinION can greatly improve the sequencing accuracy rate by the 2D and 1D2 sequencing methods, and the highest accuracy rate can reach 97%.

The document reports that Helicos can reduce the error rate of deletion types in the sequencing to be less than 1% by sequencing through a double sequencing method (Two-Pass), but the library used by the method is that a5 'end is a specific joint, and a 3' end is a polyA joint, so that the operation process is complicated; meanwhile, hot water is used in the step of denaturing and eluting the DNA chain, and incomplete elution is possible, so that subsequent sequencing primer hybridization and sequencing are interfered; and the DNA strand on the chip surface is untreated and is one of the sources of errors introduced by the subsequent sequencing.

It can be seen that further improvements are needed in the existing single molecule sequencing technology.

Disclosure of Invention

The inventors found that there was much noise in the data output from the sequencing when the human genome was subjected to sequencing studies using the GenoCare sequencing platform, and that the sequencing accuracy was low, and that the alignment Rate (Mapped Rate) was 53.59% + -9.14%, the Unique alignment Rate (Unique Mapped Rate) was 36.82% + -8.71%, and the Error Rate (Error Rate) was 6.65% + -1.04% when compared to the reference genome.

The GenoCare Sequencing platform technology principle is similar to that of Helicos, and the inventors used the synthetic sequence to perform Sequencing using the Sequencing method of Two-Pass, and analyzed according to the method described in the literature (Harris T D, Buzby P R, Babcock H, et al.Single-Molecule DNA Sequencing of a Viral Genome [ J ] Science,2008,320(5872):106 and 109.) found that the error rate of the double Sequencing base (Two-Pass base) was only reduced by about 30% compared to the Single Sequencing base (Single-Pass base) Sequencing.

Meanwhile, the inventor also finds that, for the GenoCare single-molecule sequencing platform, deletions are more likely to occur between some specific base combinations or after some specific sequences in the sequencing result, for example, the probability of the deletions occurring after continuous G base reactions is higher.

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, the invention provides an effective sequencing result analysis method.

In a first aspect of the invention, a method of sequencing result analysis is provided. According to an embodiment of the present invention, the sequencing result includes first sequencing data and second sequencing data, wherein the first sequencing data and the second sequencing data are both composed of a plurality of reads, and at least a part of the reads in the first sequencing data have corresponding reads in the second sequencing data, and the sequencing result analysis method includes: (a) performing a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data to obtain final sequence information. According to the method provided by the embodiment of the invention, the first sequencing data and the second sequencing data are obtained by a double sequencing method (two-pass) of a single-molecule sequencing platform, a correction model is constructed by using the first sequencing data and the second sequencing data so as to obtain the probabilities of deletion, insertion and mutation of intermediate bases under different front and back base combinations in a nucleic acid sequence, the first sequencing data and the second sequencing data are mutually corrected, and the correction model is used for determining whether the base of the site is insertion, deletion or mutation aiming at the different sites in the two sequencing data so as to judge the correct base of the site. It should be noted that, for the difference sites in the two sequencing data, there can be insertion, deletion, mutation, etc. of bases, and the constructed correction model can also predict the insertion, deletion or mutation of bases, and the mutation can be a mutation between any bases.

In a second aspect of the invention, the invention provides a method of obtaining the sequencing results mentioned in the first aspect of the invention. According to an embodiment of the invention, the sequencing result is obtained by: (1) performing a first sequencing on a sequencing template on a chip surface to obtain first sequencing data by forming a first nascent sequencing strand, the sequencing template being attached to the chip surface by a sequencing linker; (2) subjecting at least a portion of the 3' -end of the first nascent sequencing strand to a first blocking treatment; and (3) second sequencing the sequencing template to obtain second sequencing data by forming a second nascent sequencing strand. According to the method provided by the embodiment of the invention, two rounds of sequencing of the sequencing template can be realized, two groups of sequencing data are obtained aiming at the same template, and the sealing treatment between the first sequencing and the second sequencing can prevent the residual first nascent sequencing chain from continuing to extend during the second sequencing, so that the accuracy of the second sequencing can be effectively ensured.

In a third aspect of the invention, a sequencing result analysis system is provided. According to an embodiment of the present invention, the sequencing result analysis system includes: the sequencing device is suitable for obtaining a sequencing result through a double sequencing method, wherein the sequencing result comprises first sequencing data and second sequencing data, the first sequencing data and the second sequencing data are both composed of a plurality of reads, and at least one part of the reads in the first sequencing data have corresponding reads in the second sequencing data; an analysis device comprising a correction module adapted to perform a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data in order to obtain final sequence information. According to the sequencing result analysis system provided by the embodiment of the invention, the sequencing result analysis method provided by the first aspect of the invention can be effectively implemented, and the accuracy of the sequencing result is improved by mutually correcting the results of two rounds of sequencing. In addition, as described above, two rounds of sequencing are performed on the same template to obtain two sets of sequencing data, and the sealing treatment between the first sequencing and the second sequencing can prevent the residual first nascent sequencing chain from continuing to extend during the second sequencing, thereby effectively avoiding the generation of interference signals during the second round of sequencing, i.e., the second sequencing process, and effectively ensuring the accuracy of the second sequencing.

Furthermore, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above, according to an embodiment of the present invention.

The present invention also provides an electronic device, comprising: the computer readable storage medium as described above; and one or more processors for executing the program in the computer-readable storage medium.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a schematic flow diagram of a sequencing result analysis method according to one embodiment of the present invention;

FIG. 2 is a schematic flow chart of a sequencing result analysis method according to yet another embodiment of the present invention;

FIG. 3 is a schematic flow chart of a sequencing result analysis method according to yet another embodiment of the present invention;

FIG. 4 is a schematic flow chart of an analysis method for obtaining Consensus Reads according to one embodiment of the present invention

FIG. 5 is a schematic flow diagram of a sequencing method according to one embodiment of the present invention;

FIG. 6 is a schematic flow diagram of a sequencing method according to yet another embodiment of the invention;

FIG. 7 is a schematic flow diagram of a sequencing method according to yet another embodiment of the invention;

FIG. 8 is a schematic representation of the sequencing method to obtain Reads1 and Reads2 according to one embodiment of the present invention;

FIG. 9 is a schematic diagram of sequencing library construction according to one embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a sequencing result analysis system according to yet another embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a sequencing result analysis system according to yet another embodiment of the present invention;

FIG. 12 is a schematic structural diagram of a sequencing result analysis system according to yet another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In a first aspect of the invention, a method of sequencing result analysis is provided. According to the method provided by the embodiment of the invention, the Two rounds of sequencing results obtained by a double sequencing method using a single-molecule sequencing platform can be analyzed, the characteristics of the single-molecule sequencing platform are fully considered, sequencing data are optimized aiming at the defects of the single-molecule sequencing platform, Two rounds of sequencing data with one-to-one correspondence coordinates obtained by a double sequencing method (Two-Pass) are fully utilized, the sites which are easy to generate mutation, insertion and deletion are predicted, a correction model is established, the accuracy of sequencing analysis is greatly improved, and sequencing errors are avoided. It should be noted that the method for obtaining the first sequencing data and the second sequencing data is not limited to the double sequencing method, as long as two pieces of sequencing data can be obtained for the same template, and the two pieces of sequencing data can correspond to each other one by one.

Referring to fig. 1 to 4, the sequencing result includes first sequencing data and second sequencing data, the first sequencing data and the second sequencing data are obtained by a double sequencing method, wherein each of the first sequencing data and the second sequencing data is composed of a plurality of reads, and at least a portion of the reads in the first sequencing data have corresponding reads in the second sequencing data, and the sequencing result analysis method includes: performing a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data to obtain final sequence information.

According to an embodiment of the invention, said mutual correction comprises the following steps: selecting high quality reads and corresponding reads of the high quality reads in the first sequencing data and the second sequencing data, the reads not being less than a predetermined length, the reads having a sequencing quality not less than a predetermined quality threshold; and comparing the high-quality reads with corresponding reads of the high-quality reads, and correcting sequence information based on the comparison result. According to an embodiment of the present invention, the predetermined length may be determined according to a threshold of read length in conventional sequencing, and in an embodiment of the present invention, the predetermined length is generally about 25bp, and the predetermined length is used for filtering noise sequences, so as to improve accuracy of sequencing data alignment.

According to the embodiment of the invention, the accuracy of the sequencing result can be improved by mutually correcting the results of two rounds of sequencing. In addition, after the first round of sequencing, namely the first sequencing, the 3' -end of the nascent sequencing strand remaining on the chip surface is blocked, so that the generation of interference signals during the second round of sequencing, namely the second sequencing, can be effectively avoided. This can further improve the accuracy of the sequencing result.

Referring to fig. 2, the mutual correction comprises the following steps:

s100, constructing a first read set

In this step, a first read set is constructed based on the first sequencing data according to the lengths of the reads, and the length of each read in the first read set is not lower than a first predetermined length.

S200 constructing a second read set and a third read set

In this step, according to the lengths of the corresponding reads, based on the first read set, a second read set and a third read set are constructed, where the length of the corresponding read of each read in the second read set is not lower than a second predetermined length, and the length of the corresponding read of each read in the third read set is within a predetermined length range.

S300, constructing a fourth read set and a fifth read set

In this step, according to the sequencing quality of the reads and the corresponding reads in the second read set, a fourth read set and a fifth read set are constructed based on the second read set and the corresponding reads.

According to an embodiment of the present invention, the fourth set of reads and the fifth set of reads are respectively determined according to the following principles:

comparing the reads in the second set of reads to the sequencing quality of their corresponding reads;

selecting the side with high sequencing quality as an element of the fourth read set, and selecting the side with low sequencing quality as an element of the fifth read set;

for the case of the same sequencing quality, then the reads from the second set of reads are selected as elements of the fourth set of reads, then the corresponding reads are selected as elements of the fifth set of reads.

S400, constructing a sixth read set

In this step, the fourth set of reads is filtered with sequencing quality to construct a sixth set of reads, where the sequencing quality of the reads in the sixth set of reads is not lower than a first predetermined quality threshold.

S500 constructing a seventh read set

In this step, the reads corresponding to the reads in the sixth set of reads are selected from the fifth set of reads using the sixth set of reads to construct a seventh set of reads.

S600, comparing the sixth read set with the seventh read set to determine a first difference site

In this step, the sixth read set is read-aligned with the seventh read set, and a first discrepancy site is determined on the reads of the sixth read set

S700, correcting the first difference site

In this step, the first differential sites are corrected using a predetermined sequencing error prediction model for determining the probability of an insertion or deletion of a differential site during the sequencing process, in order to determine the first sequence information.

Referring to fig. 3, after obtaining the first sequencing information, the method may further include:

s400a construction of an eighth read set

In this step, the third set of reads is filtered with sequencing quality to construct an eighth set of reads, wherein none of the reads in the eighth set of reads have sequencing quality below a second predetermined quality threshold.

S500a construction of a ninth read set

In this step, the reads corresponding to the reads in the seventh set of reads are selected from the second sequencing data using the eighth set of reads to construct a ninth set of reads.

S600a, comparing the eighth read set with the ninth read set to determine a second difference site

In this step, read alignment is performed on the eighth read set and the ninth read set, and a second difference site is determined on the reads of the eighth read set.

S700a correcting the second difference site to determine the second sequence information

In this step, the second difference site is corrected using the sequencing error prediction model to determine second sequence information.

According to an embodiment of the present invention, the sequencing error prediction model is obtained by training a naive bayes model based on the comparison of the first sequencing data and the second sequencing data with a reference genome.

According to an embodiment of the invention, for the first and second differential sites:

if a read from the sixth set of reads has a base at the difference site, a corresponding read from the seventh set of reads has no base at the difference site, and the probability of a deletion at the difference site is 50% or more, retaining the base of the read of the sixth set of reads at the difference site as a final sequencing result.

If the reads from the sixth set of reads do not have a base at the difference site, the reads from the seventh set of reads have a base at the difference site, and the probability of insertion at the difference site is 50% or more, retaining the base of the reads of the sixth set of reads at the difference site as a final sequencing result; and

selecting a base of a read from the sixth set of reads at the difference site as a final sequencing result if a base is present on the read from the sixth set of reads at the difference site and a base is also present on a read from the seventh set of reads at the difference site.

According to an embodiment of the present invention, the first predetermined length and the second predetermined length are each independently not less than 20bp, preferably not less than 25bp, and the predetermined length ranges from 10 to 25 bp; said first predetermined quality threshold and said second predetermined quality threshold are each independently not lower than 50, preferably not lower than 60.

Sequencing method

In a second aspect of the present invention, the present invention provides a sequencing method for reducing sequencing output sequence noise and error rate of a single molecule sequencing platform (e.g., GenoCare single molecule sequencing platform), and the sequencing method according to the embodiment of the present invention is described with reference to fig. 4 to 9.

According to an embodiment of the invention, the method comprises:

s10 first sequencing to obtain first sequencing data

In this step, a first sequencing is performed on a sequencing template on the surface of the chip, which is attached to the surface of the chip by a sequencing linker, to obtain first sequencing data by forming a first nascent sequencing strand.

The term "chip" as used herein refers to a sequencing chip used in a sequencing platform, and can be processed by the method of the present invention as long as sequencing is performed by the principle of sequencing by synthesis, wherein a single-molecule sequencing platform, such as a GenoCare single-molecule sequencing platform, is preferably used. Of course, it will be understood by those skilled in the art that other single molecule sequencing platforms may be used, and will not be described in detail herein.

Referring to fig. 6, before step S10, a chip that can be used for a single molecule sequencing platform can also be obtained by:

s10 a: hybridizing library molecules in the sequencing library with sequencing linkers on the surface of the chip;

s10 b: forming the sequencing template by synthesizing complementary strands using the library molecules as an initial template; and

s10c removing the initial template, and carrying out the second blocking treatment on the 3' -end of the nucleic acid molecule on the chip surface.

Thus, the influence of the remaining active 3' -end on the subsequent reaction can be further removed by the second blocking treatment.

Referring to fig. 7, before proceeding to the step S10c, the method may further include the step S11 b: the third blocking treatment is performed on the 3' -end of the complementary strand incompletely extended in step S10 b. Thus, the accuracy of sequencing can be further improved, and the undesirable sequencing noise can be reduced.

S20 performing the first blocking process on the 3 '-end of at least a part of the first nascent sequencing strand in this step, the first blocking process is performed on the 3' -end of at least a part of the first nascent sequencing strand, and the first blocking process can effectively increase the amount of valid data and reduce the interference of invalid data on information analysis.

According to one embodiment of the present invention, step S20 includes removing the first nascent sequencing strand from the chip surface and performing a first blocking process on the 3' -end of the first nascent sequencing strand remaining on the chip surface.

According to one embodiment of the present invention, step S20 includes performing a first blocking process on the 3' -end of the first nascent sequencing strand and removing the blocked first nascent sequencing strand.

S30 second sequencing to obtain second sequencing data

In this step, the sequencing template is subjected to a second sequencing to obtain second sequencing data by forming a second nascent sequencing strand.

According to the embodiment of the invention, by performing two rounds of sequencing, after the first round of sequencing, namely the first sequencing, and performing blocking treatment on the 3' -end of the nascent sequencing strand remained on the surface of the chip, the generation of interference signals in the second round of sequencing, namely the second sequencing process can be effectively avoided. Therefore, the accuracy of the sequencing result can be improved.

According to an embodiment of the present invention, the first blocking treatment, the second blocking treatment, and the third blocking treatment may be independently performed by attaching a 3' -terminal hydroxyl group to an extension reaction blocking agent, respectively. This further improves the blocking effect, which further improves the accuracy of sequencing and reduces undesirable sequencing noise.

According to an embodiment of the invention, the extension reaction blocker is ddNTP or a derivative thereof. This further improves the blocking effect, which further improves the accuracy of sequencing and reduces undesirable sequencing noise.

According to an embodiment of the present invention, the first blocking treatment, the second blocking treatment, and the third blocking treatment are each independently performed using at least one of a DNA polymerase and a terminal transferase. This further improves the blocking effect, which further improves the accuracy of sequencing and reduces undesirable sequencing noise.

According to an embodiment of the present invention, the first blocking treatment and the third blocking treatment are each independently to link the ddNTP or a derivative thereof by a polymerase, and the second blocking treatment is to link the ddNTP or a derivative thereof by the terminal transferase. This further improves the blocking effect, which further improves the accuracy of sequencing and reduces undesirable sequencing noise.

According to a specific embodiment of the invention, the linker for construction of a GenoCare single-molecule Two-Pass sequencing library is obtained by annealing an oligonucleotide chain D7-S1-T and D9-S2 with 5' phosphate group modification, and the sequencing primer is D7S 1T-R2P. Wherein the sequence of D7-S1-T is SEQ ID NO: 1, the sequence of the D9-S2 is SEQ ID NO: 2, the sequence of the D7S1T-R2P is SEQ ID NO: 3.

first, Two-Pass sequencing was performed on the GenoCare single molecule sequencing platform using the above linker and sequencing primers to obtain Reads1 and Reads 2:

the method comprises the following steps: construction of a Two-Pass sequencing library, Using library preparation kit (Universal DNA Library Prep Kit for Ill. mu. Mina V2(ND606-01)) the annealed D7-S1-T/D9-S2 linker was ligated to the prepared fragmented human gDNA, and the target Library was obtained by direct purification using a purification Kit (VAHTS DNA Clean Beads (N411-01)) without PCR amplification after ligation.

Step two: and (3) hybridizing the library obtained in the step one with a sequencing chip surface joint.

Step three: and (3) carrying out complementary strand synthesis on the initial template hybridized on the surface of the chip in the second step.

Step four (optionally): 3' OH of the nascent chain which is not fully extended in the third step is blocked, so that the interference of the nascent chain on the sequencing process is reduced;

step five: and (3) denaturing the initial template hybridized to the chip surface in the second step.

Step six: and 3' OH of the residual joint on the surface of the chip is blocked, so that the interference of the residual joint on the sequencing process is reduced.

Step seven: and hybridizing the sequencing primer D7S1T-R2P by taking the complementary strand synthesized in the third step as a template.

Step eight: and (3) performing Read1 sequencing by using the complementary strand synthesized in the third step as a template and the sequencing primer D7S1T-R2P hybridized in the seventh step as a primer.

Step nine: and (5) denaturing to remove the nascent sequencing strand in the step eight.

Step ten: blocking the 3' OH of the nascent sequencing strand in step eight, which may remain after the processing of step nine, prevents it from continuing to extend during Read2 sequencing.

Step eleven: and hybridizing the sequencing primer D7S1T-R2P by taking the complementary strand synthesized in the third step as a template.

Step twelve: and (3) performing Read2 sequencing by using the complementary strand synthesized in the third step as a template and the sequencing primer D7S1T-R2P hybridized in the eleventh step as a primer.

Step thirteen: and splitting the sequencing data obtained in the eighth step and the twelfth step to obtain two-part sequences of Reads1 and Reads2 with one-to-one coordinates.

Further, an analysis method for analyzing the Reads1 and Reads2 obtained above to obtain Consensus Reads, comprising:

fourteen steps: and constructing a correction model, extracting Reads with the same coordinate and the length of more than or equal to 25bp obtained in the thirteenth step in the sequences of Reads1 and Reads2 in two times of sequencing, respectively outputting the Reads as two files of T1 (Reads 1) and T2 (Reads 2), respectively comparing the Reads in T1 and T2 with a reference genome, and calculating the probability of the middle base generating Deletion or insert under different front and back base combinations by a naive Bayes method. In the prediction process, for intermediate bases under different front and back base combinations, whether the intermediate bases are reserved is determined according to the probability of Deletion or Insertion in the model. If the probability of Deletion is more than 50%, the intermediate base is retained, otherwise, the intermediate base is discarded.

Step fifteen: the Reads1 data obtained in step thirteen were filtered according to read length, and the read1 sequence with read length ≥ 25bp was named Fa 1. The short read sequence is filtered by using the length of 25bp, so that a part of noise sequence can be removed, and the accuracy of sequencing data mapping is improved.

Sixthly, the steps are as follows: and splitting Fa1 obtained in the step fifteen corresponding to the sequence Read length of the Read2 obtained in the step thirteen into two sets, wherein the set of Reads in the Fa1 corresponding to the Read2 being more than or equal to 25bp is named Fa2, and the set of Reads in the Fa1 corresponding to the Read2 being more than or equal to 10bp being less than or equal to 25bp is named Fa 3. The purpose of splitting Fa1 into two parts for analysis here is to reduce the loss of data throughput due to length filtering while improving the accuracy of Consensus Reads.

Seventeen steps: the results in Fa2 obtained in step sixteenth and the results in Reads2 obtained in step thirteenth corresponding to the coordinates of Fa2 are compared to each other in terms of Q value, and the set of the results with higher Q value (the results in Fa2 when the Q values are equal) is named Fa4, and the set of the results with lower Q value (the results in Reads2 when the Q values are equal) is named Fa 5. The purpose of this step is to divide the sequences in Reads1 and Reads2 into two sets of relatively higher and lower sequencing quality, ensuring that the final output of the sequences in Consensus Reads is the higher of the two sequencing qualities of the relatively more accurate Reads.

Eighteen steps: further filtering Fa4 and Fa5 obtained in the seventeenth step, and naming the set of Reads with Q value of 60 in Fa4 as Fa6 and the set of Reads in Fa5 corresponding to the coordinates of the Reads in Fa6 as Fa 7.

Nineteen steps: aligning Fa6 obtained in the eighteen steps with Reads in Fa7 one by one, grading according to the similarity of sequences of the Fa6 and the Fa7, correcting Fa6 by taking Fa7 as a reference sequence, marking positions different from Fa7 in the Fa6 sequence, and judging whether bases at different positions are Deletion or insert one by one according to a correction model constructed in the fourteenth step so as to obtain corrected Consensus Reads Part1 for output.

The different position representation in this step only detects bases at one position on only one reading of Fa6 or Fa 7. At this time, the orthotics model constructed in step fourteen will determine whether the base should be retained. If bases were detected at both Fa6 and Fa7 at a certain position, but the base types were not identical, the base type of Fa6 was used as the standard, and this model did not correct the above situation.

Twenty steps: the Reads in Fa3 obtained in the sixteenth step were further filtered, and the collection of Reads with Q value ≧ 60 in Fa3 was named Fa 8.

Twenty one: and extracting a set of the Reads2 obtained in the step thirteen and corresponding to the coordinates in the Fa8 in a one-to-one mode, and naming the set as Fa 9.

Step twenty-two: and aligning the Fa8 with the Reads in the Fa9 one by one, grading according to the sequence similarity, correcting the Fa8 by taking Fa9 as a reference sequence, marking a position different from the Fa8 in the Fa8, and judging whether the bases at different positions are Deletion or insert one by one according to a correction model constructed in the fourteenth step so as to obtain corrected Consensus Reads Part2 for output.

Twenty-three steps: and merging the Consensus Reads Part1 with different similarity levels with the Consensus Reads Part2 according to different application requirements to obtain the output Consensus Reads.

According to the embodiment of the present invention, the library and sequencing chip surface linker hybridization process for step two comprises (reagent convention):

1) pre-denaturing the library for hybridization at 90-100 ℃ for 2-5 minutes;

2) rapidly cooling the product obtained from the step 1) on an ice-water mixture for more than 2 minutes to obtain a denatured hybridization library mother liquor;

3) diluting the denatured hybridization library mother liquor obtained from the step 2) to a proper concentration, preferably 0.1-2 nM, by using 80% GenoCare hybridization solution, so as to obtain a diluted hybridization library;

4) introducing 30-50 mu L of diluted hybridization library obtained in the step 3) into a sequencing chip channel pretreated by a redissolving reagent, and hybridizing for 10-30 minutes at 40-60 ℃;

5) introducing 200-1000 mu L of cleaning solution 1 into the chip channel, and removing the diluted hybridization library remained after the hybridization in the step 4);

6) and (3) introducing a cleaning solution 2 with the volume of 200-1000 mu L into the chip channel, removing the cleaning solution 1 in the step 5), and completing the hybridization of the library and the surface joint of the sequencing chip.

According to an embodiment of the invention, the GenoCare hybridization solution is a3 SSC solution.

According to an embodiment of the invention, the components of the reconstitution reagent comprise: the cleaning solution 1 comprises the following components: 150mM sodium chloride, 15mM sodium citrate, 150mM 4-hydroxyethylpiperazine ethanesulfonic acid, 0.1% sodium dodecyl sulfate.

According to the embodiment of the invention, the cleaning solution 3 comprises the following components: 450mM sodium chloride, 45mM sodium citrate.

According to an embodiment of the present invention, the cleaning liquid 2 comprises the following components: 150mM sodium chloride, 150mM 4-hydroxyethylpiperazine ethanesulfonic acid.

According to an embodiment of the present invention, the process of complementary strand synthesis for the initial template of step three comprises:

1) introducing 200-1000 mu L of extension reagent into the chip channel, and reacting for 5-10 minutes at 50-70 ℃;

2) introducing 200-1000 mu L of cleaning solution 1 into the chip channel, and removing the extension reagent reacted in the step 1);

3) and (3) introducing a cleaning solution 2 with the volume of 200-1000 mu L into the chip channel, removing the cleaning solution 1 in the step 2), and completing the synthesis of the complementary strand of the initial template.

According to an embodiment of the invention, the extension reagent comprises the following components: the DNA polymerase is preferably Bst DNA polymerase, Bsu DNA polymerase, Klenow DNA polymerase, etc., 0.2-2 mM dNTP, 0.5-2M betaine, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, and has a pH of 8.3.

According to an embodiment of the present invention, the process of blocking the 3' OH of the chain that is incompletely extended as described for step four comprises:

1) introducing a sealing reagent 1 with the volume of 200-1000 mu L into a chip channel, and reacting for 5-30 minutes at the temperature of 30-60 ℃;

2) and (3) introducing 200-1000 mu L of cleaning solution 1 into the chip channel, removing the blocking reagent 1 reacted in the step 1), and completing blocking of 3' OH of the chain which is incompletely extended.

According to an embodiment of the invention, the components of the blocking reagent 1 comprise: the DNA polymerase is preferably Klenow DNA polymerase, Bsu DNA polymerase, N9 DNA polymerase, etc., 10-100. mu.M ddNTP, 5mM manganese chloride, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, and has a pH of 8.3.

According to an embodiment of the present invention, the process of removing the initial template for step five includes:

1) introducing a denaturing reagent with the volume of 200-1000 mu L into the chip channel, preferably, the denaturing reagent can be formamide, 0.1M NaOH and the like, and reacting for 2-5 minutes at 50-60 ℃;

2) introducing 200-1000 mu L of cleaning solution 1 into a chip channel, and removing the denaturating reagent reacted in the step 1) and the initial template denatured and separated from the chip;

3) and (3) repeating the step 1) and the step 2) once to finish the removal of the initial template.

According to the embodiment of the invention, the process for closing the residual joint 3' -OH on the surface of the chip in the step six comprises the following steps:

1) introducing 200-1000 mu L of cleaning solution 2 into the chip channel;

2) introducing a sealing reagent 2 with the volume of 200-1000 mu L into the chip channel, and reacting for 5-30 minutes at the temperature of 30-60 ℃;

3) and (3) introducing 200-1000 mu L of cleaning solution 1 into the chip channel, removing the sealing reagent 2 reacted in the step 2), and sealing the residual joint 3' -OH on the surface of the chip.

According to an embodiment of the invention, the components of the blocking reagent 2 comprise: 100U/mL of Terminal Transferase (NEB, M0315L), 1 × Terminal Transferase Buffer, 0.25mM of cobalt chloride, and 10-100 μ M of ddNTP.

According to the embodiment of the invention, the process of hybridizing the sequencing primer D7S1T-R2P for step seven comprises:

1) diluting the mother solution of the sequencing primer D7S1T-R2 to a proper concentration, preferably 0.1-1 mu M, by using a cleaning solution 3 to obtain a diluted sequencing primer hybridization solution;

2) introducing 200-1000 mu L of diluted sequencing primer hybridization solution obtained in the step 1) into a chip channel, and hybridizing for 10-30 minutes at 50-60 ℃;

3) introducing 200-1000 mu L of cleaning solution 1 into the chip channel, and removing the residual sequencing primer after hybridization in the step 2);

4) and (3) introducing a cleaning solution 2 with the volume of 200-1000 mu L into the chip channel, removing the cleaning solution 1 in the step 3), and completing the hybridization of the sequencing primer.

According to an embodiment of the invention, the sequencing process of Read1 described for step eight was performed with reference to the description in the GenoCare single-molecule two-color sequencing Universal kit (docket No.: Yuetui Med 20190887).

According to an embodiment of the present invention, the process of removing nascent sequencing strands described for step nine is performed with reference to step five.

According to an embodiment of the present invention, the process of blocking the 3' OH of the residual nascent strand described in step ten is performed with reference to step four.

According to an embodiment of the present invention, the process for hybridizing the sequencing primer D7S1T-R2P described in step eleven is performed with reference to step seven.

According to an embodiment of the present invention, the sequencing process for Read2 described in step twelve is performed with reference to step eight.

According to an embodiment of the present invention, the process of splitting the sequencing data to obtain two-part sequences of Reads1 and Reads2 with one-to-one coordinates as described in step thirteen includes:

dividing each Read in the ". fa _" file output by BaseCall into two parts from the middle averagely according to the number of sequencing cycles by using python language, and respectively outputting two ". fa _" files "Reads1. fa _" and "Reads2. fa _" with consistent sequence coordinates;

removing characters _ "in Reads used in the" Reads1.fa _ "file and the" Reads2.fa _ "file obtained in the step 1) by using python language, outputting the" Reads1.fa "file and the" Reads2.fa "file, and completing the splitting of the sequencing data to obtain two-part sequences of Reads1 and Reads2 with one-to-one corresponding coordinates.

According to an embodiment of the present invention, the construction process of the correction model for step fourteen includes:

1) extracting Reads with the same coordinate and the reading length of more than or equal to 25bp in the sequences of Reads1 and Reads2 obtained in the step thirteen, and respectively outputting two fast files of T1(Read1) and T2(Read 2);

2) sliding and aligning two corresponding Reads in the T1 and T2 files obtained in the step 1), and marking bases with the same or different two Reads in the alignment result to obtain Common Reads;

3) mapping the T1 and T2 files obtained in the step 1) with reference sequences respectively to obtain Sam1 and Sam2 files;

4) finding the longest common substring Ref Reads in the reference sequence according to the corresponding Reads mapped to the same position in Sam1 and Sam2 obtained from step 3);

5) comparing different bases sequenced twice in Common Reads obtained in the step 2) with Ref Reads obtained in the step 4), and calculating the probability of the middle base generating Deletion or Insertion under different front and back base combinations by using a naive Bayes method to complete the construction of the correction model.

According to the method provided by the embodiment of the invention, a set of sequencing methods which jointly use a joint D7-S1-T/D9-S2 and a sequencing primer D7S1T-R2P and use the GenoCare single-molecule sequencing platform to perform Two-Pass sequencing to obtain Reads1 and Reads2 are provided by combining the characteristics of the GenoCare single-molecule sequencing platform. In another aspect, the present invention provides an assay for Consensus Reads using the Reads1 and Reads2 obtained by the Two-Pass sequencing method. This analysis method can significantly reduce the noise sequence and base error rate in the output Consensus Reads.

Sequencing result analysis system

In a third aspect of the present invention, the present invention also provides a sequencing result analysis system capable of implementing the above-mentioned sequencing result analysis method. Referring to FIGS. 10-12, according to an embodiment of the invention, the system includes: the sequencing device is suitable for obtaining a sequencing result through a double sequencing method, wherein the sequencing result comprises first sequencing data and second sequencing data, the first sequencing data and the second sequencing data are both composed of a plurality of reads, and at least one part of the reads in the first sequencing data have corresponding reads in the second sequencing data; an analysis device comprising a correction module adapted to perform a mutual correction based on at least a portion of each of the first sequencing data and the second sequencing data in order to obtain final sequence information.

The system can effectively implement the sequencing result analysis method and the sequencing method, so that the accuracy of the sequencing result can be improved by mutually correcting the results of two rounds of sequencing. In addition, as described above, by blocking the 3' -end of the nascent sequencing strand remaining on the chip surface after the first round of sequencing, i.e., the first sequencing, it is possible to effectively avoid the generation of interfering signals during the second round of sequencing, i.e., the second sequencing. This can further improve the accuracy of the sequencing result.

The invention further provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method as set forth above.

The present invention also provides an electronic device, comprising: the computer readable storage medium as described above; and one or more processors for executing the program in the computer-readable storage medium.

The invention will be further explained with reference to specific examples. The experimental procedures used in the following examples are all conventional procedures unless otherwise specified. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Examples

The embodiment provides a set of sequencing and analysis methods for reducing the noise and error rate of the output sequence of the GenoCare single-molecule sequencing platform. Wherein the Genocare single molecule sequencing platform is a platform for detecting incorporated nucleotide species using the TIRF imaging system. There are several ways to sequence Genocare, the first: the four nucleotides have the same fluorescent signal, and one nucleotide is added in each reaction for signal detection; the second mode is as follows: the four nucleotides have two different fluorescent signals, and two nucleotides are added in each reaction for signal detection; the third mode is as follows: four nucleotides carry four different fluorescent signals, and four nucleotides are added in each reaction cycle for signal detection. Specific sequencing procedures can be found in the description of Single molecule μ Le targeted sequencing for cancer gene mutation detection, Scientific reproRts |6:26110| DOI:10.1038/srep26110, patents CN201680047468.3, CN201910907555.7, CN201880077576.4 or CN 201911331502.1.

Further, the sequencing and analysis method provided by this embodiment includes:

a set of sequencing methods jointly uses a joint D7-S1-T/D9-S2 and a sequencing primer D7S1T-R2P, and uses a GenoCare single molecule sequencing platform to perform Two-pass sequencing to obtain Reads1 and Reads2. Wherein the joint D7-S1-T/D9-S2 consists of an oligonucleotide chain D7-S1-T and D9-S2 with 5' phosphate group modification. The sequence of the D7-S1-T is SEQ ID NO: 1, the sequence of the D9-S2 is SEQ ID NO: 2, the sequence of the sequencing primer D7S1T-R2P is SEQ ID NO: 3. specifically, the sequences and names of the primers involved in the present invention are shown in Table 1.

Table 1: primer sequences and names

2) A set of assays was performed to obtain Consensus Reads by assaying Reads1 and 2 obtained from the Two-Pass sequencing method described above. This analysis method can significantly reduce the noise sequence and base error rate in the output Consensus Reads.

Further, the sequencing method provided in this example for obtaining Reads1 and Reads2 by performing Two-Pass sequencing using the GenoCare single molecule sequencing platform using the linker D7-S1-T/D9-S2 and the sequencing primer D7S1T-R2P in combination comprises:

the method comprises the following steps: and constructing a Two-Pass sequencing library. Use ofThe Universal DNA Library Prep Kit for Ill. mu. Mina V2(ND606-01) the annealed D7-S1-T/D9-S2 linker was ligated to the prepared fragmented human gDNA, and the target Library was obtained by direct purification using VAHTS DNA Clean Beads (N411-01) without PCR amplification after ligation.

Specifically, the steps for constructing the Two-Pass sequencing library in this example include:

1) human gDNA fragmentation: the parameter Peak Power, 75, was set using Covaris; duty Factor, 25; Cycles/Burst, 50; time(s), 250, 0.1-1 ug human gDNA is subjected to ultrasonic break to obtain 100-300 bp DNA fragments. Alternatively, this step can be performed by using an enzyme digestion method.

2) The DNA fragments were subjected to end repair and A-tailing, and the reaction system is shown in Table 2.

Table 2: reaction system

The reaction conditions are as follows: the reaction was carried out at 20 ℃ for 15 minutes, followed by 65 ℃ for 10 minutes.

3) The end-repair plus A product was ligated to the linker and the reaction system is shown in Table 3.

Table 3: reaction system

End-repair plus A product 20μL
D7-S1-T/D9-S2 joint (20 mu M) 5μL
Ligation Mix 25μL
Total 50μL

The reaction conditions are that the mixture is evenly mixed and then is placed for 15min at room temperature.

4) Ligation product purification

Purification the sequencing library was constructed by purification using reagents and procedures described in the specification for VAHTS DNA Clean Beads (N411-01) and recovering 10. mu.L of product. The method comprises the following specific steps:

a) transferring the connected PCR system to a 1.5mL EP tube, adding 0.8 x (40 mu L) magnetic beads, mixing by blowing for 10 times, and standing for 3 minutes at room temperature;

b) placing a 1.5mL EP tube on a magnetic frame, standing for 2-3 minutes, and removing a supernatant;

c) wash with 200 μ L volume 80% ethanol, rinse magnetic beads, incubate for 30sec at room temperature, carefully remove supernatant;

d) uncovering and drying the magnetic beads for about 5-10 minutes until the residual ethanol is completely volatilized;

e) adding 22 mu L deionized water, eluting from the magnetic frame, mixing, standing at room temperature for 3 min, standing on the magnetic frame for 3 min, clarifying, recovering 20 mu L product, adding 1.2x (24 mu L) magnetic bead, mixing by blowing for 10 times, and standing at room temperature for 3 min;

f) placing a 1.5mL EP tube on a magnetic frame, standing for 2-3 minutes, and removing a supernatant;

g) repeating steps c) -d) once;

h) adding 11 mu L deionized water, taking down from the magnetic frame for elution, fully and uniformly mixing, standing at room temperature for 3 minutes, placing on the magnetic frame for 3 minutes, and recovering 10 mu L of product after the liquid is clarified to complete the construction of the sequencing library.

5) Quantification and detection

The constructed library was tested for concentration using a Qubit 3.0 instrument and the Qubit dsDNA HS detection kit.

Fragment distribution detection was performed on the constructed library using a Labchip DNA HS detection kit and a Labchip instrument.

Step two: and (3) hybridizing the library obtained in the step one with a sequencing chip surface probe.

Chip selection:

1) chip selection: the chip used was an epoxy-modified chip having the sequence 5'-TTTTTTTTTTTCCTTGATACCTGCGACCATCCAGTTCCACTCAGATGTGTATAAGAGACAGT-3' (SEQ ID NO:4) via the amino group on the probe and the chip.

The library was hybridized to the on-chip probe as follows:

1) taking 3 mu L of the sequencing library constructed in the step one with the volume of 20nM concentration, adding 3 mu L of deionized water, uniformly mixing, and performing thermal denaturation at 95 ℃ for 5 minutes;

2) rapidly placing the denatured library obtained from step 1) in an ice-water mixture to cool for more than 2 minutes;

3) to the product of step 2) was added a 24 μ L volume of GenoCare hybridization solution and the library was diluted to a working concentration of 2 nM. The hybridization solution was 3XSSC buffer, and the 3XSSC solution was prepared by diluting 20 XSSC buffer ((Sigma, # S6639-1L)) with nuclease-free water (RNase-free water).

4) Introducing 30. mu.L of the hybridization library obtained from step 3) diluted in volume into one channel of the slave chip, performing hybridization reaction at 42 ℃ for 30 minutes, and then cooling to room temperature;

5) introducing 200 mu L of cleaning solution 1 into the hybridization channel obtained in the step 4), and removing the library which is not hybridized to the surface of the chip;

and introducing a 200 mu L volume of cleaning solution 2 into the chip hybridization channel to replace the cleaning solution 1 in the channel, thereby completing the hybridization of the library and the surface joint of the sequencing chip.

The cleaning solution 1 comprises the following components: 150mM sodium chloride, 15mM sodium citrate, 150mM 4-hydroxyethylpiperazine ethanesulfonic acid, 0.1% sodium dodecyl sulfate.

The cleaning liquid 2 comprises the following components: 150mM sodium chloride, 150mM 4-hydroxyethylpiperazine ethanesulfonic acid.

Step three: the initial template is subjected to complementary strand synthesis.

The initial template is a library which is subjected to hybridization with the probe in the step two, and the specific steps of the complementary strand synthesis of the initial template are as follows:

1) placing the chip which completes library hybridization in the step two in a GenoCare sequencer;

2) pumping 750 μ L volume of an extension reagent into the chip hybridization channel, wherein the extension reagent comprises the following components: 120U/mL Bst DNA polymerase (NEB, # M0275M), 0.2mM dNTP (a mixture of 0.2. mu.M each of dATP, dTTP, dCTP, dGTP), 1M betaine, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, pH 8.3;

3) heating the chip to 60 +/-0.5 ℃ and reacting for 10 minutes;

4) pumping a cleaning solution 1 with the volume of 220 mu L into the chip hybridization channel, and removing the extension reagent;

5) and (3) pumping a 440 mu L volume of washing solution 2 into the chip hybridization channel, and removing the washing solution 1 in the step 4) to complete the synthesis of the initial template complementary strand.

Step four (optionally): and (3) blocking the 3' OH of the nascent chain which is not fully extended in the step three, wherein the specific steps of blocking are as follows:

1) cooling the chip to 37 +/-0.5 ℃, and maintaining for 90 seconds;

2) and pumping 750 mu L of blocking reagent 1 into the extended channel in the third step, and reacting for 10 minutes. The blocking reagent 1 comprises the following components: 100U/mLKLenow DNA polymerase large fragment (3 '→ 5' exo-, NEB, # M0212M) 12.5. mu.M ddNTP mix (12.5. mu.M each of ddATP, ddTTP, ddCTP, ddGTP), 5mM manganese chloride, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, pH 8.3;

3) and (3) introducing a 220 mu L volume of cleaning solution 1 into the closed channel in the step 2), removing the residual closing solution after the closing reaction, and finishing the closing of the 3' OH of the incompletely extended nascent chain.

Step five: denaturing to remove the initial template, wherein the process of removing the initial template is as follows:

1) cooling the chip to 55 +/-0.5 DEG C

2) Introducing formamide into the closed channel in the step four, wherein the volume of the formamide is 800 mu L, and denaturing for 2 minutes;

3) introducing a cleaning solution 1 with the volume of 220 mu L into the channel after denaturation in the step 2), and removing the initial template after denaturation;

4) and (5) repeating the step 2) and the step 3) once to finish the removal of the initial template.

Step six: and (3) sealing the 3 'OH of the residual joint on the surface of the chip, wherein the process of sealing the 3' OH of the residual joint on the surface of the chip comprises the following steps:

1) cooling the chip to 37 +/-0.5 ℃;

2) introducing a cleaning solution 2 with the volume of 440 mu L into the closed channel in the step five to replace the residual cleaning solution 1 in the channel;

3) and (3) introducing 750 mu L of blocking reagent 2 into the channel treated in the step 2), and reacting for 15 minutes. Wherein, the components of the blocking reagent 2 are as follows: 100U/mL Terminal Transferase (NEB, M0315L)), 1 × Terminal Transferase Buffer, 0.25mM cobalt chloride, 100 μ M ddNTP mix (100 μ M each of ddATP, ddTTP, ddCTP, and ddGTP);

4) and (3) introducing a cleaning solution 1 with the volume of 220 mu L into the closed channel in the step 3) to finish the closing of the residual joint 3' OH on the surface of the chip.

Step seven: the procedure for hybridizing the sequencing primer D7S1T-R2P and the sequencing primer D7S1T-R2P was as follows:

1) heating the chip to 55 +/-0.5 ℃, and keeping the temperature for 1 minute;

2) and (5) introducing a diluted sequencing primer hybridization solution with the volume of 800 mu L into the closed channel in the step six, and carrying out hybridization reaction for 30 minutes. The diluted sequencing primer hybridization solution is cleaning solution 3 containing 0.1 mu M of primer D7S1T-R2P, and the cleaning solution 3 comprises the following components: 450mM sodium chloride, 45mM sodium citrate;

3) cooling the chip to 37 +/-0.5 ℃, and keeping for 90 seconds;

4) introducing a cleaning solution 1 with the volume of 220 mu L into the hybridization channel in the step 2), and removing the sequencing primer which is not hybridized in the channel;

5) and (3) introducing a 440 mu L volume of washing solution 2 into the channel treated in the step 4), and replacing the remaining washing solution 1 in the channel to complete the hybridization of the sequencing primer.

Step eight: sequencing Read1 was performed, and the sequencing procedure of Read1 was as follows:

and carrying out 80-cycle sequencing by using a Genocare single-molecule sequencing platform, wherein four nucleotides with two different fluorescent signals are adopted in the sequencing process, and two nucleotides marked with different fluorescent signals are added in each reaction cycle for carrying out signal detection to carry out sequencing.

Step nine: the nascent sequencing strand was removed.

The process of removing the nascent sequencing strand was performed according to the procedure in step five.

Step ten: blocking the 3' OH of the residual nascent strand.

The process of blocking the 3' OH of the residual nascent strand proceeds as in step four.

Step eleven: the sequencing primer D7S1T-R2P was hybridized.

The process of hybridizing the sequencing primer D7S1T-R2P was performed according to the procedure in step seven.

Step twelve: read2 sequencing was performed.

The sequencing process of Read2 is performed according to the steps in step eight.

Step thirteen: and splitting the sequencing data to obtain two-part sequences of Reads1 and Reads2 with one-to-one correspondence of coordinates.

Specifically, the process of splitting the sequencing data to obtain two-part sequences of Reads1 and Reads2 with one-to-one coordinates in this embodiment includes:

splitting each Read in a 160-cycle sequencing BaseCall output & lt- & gt fa _ & gt file into a front 80-cycle part and a rear 80-cycle part by using python language, removing characters & lt- & gt in all the Reads, respectively outputting two parts of & lt- & gt fa & lt- & gt files & gt Reads1.fa & gt and & lt- & gt 2.fa & gt with consistent sequence coordinates, and completing the splitting of sequencing data to obtain two parts of sequences with coordinates corresponding to one, namely Reads1 and Reads2.

Further, the set of analysis methods provided in this embodiment for analyzing the Reads1 and Reads2 obtained by the Two-pass sequencing method to obtain Consensus Reads includes:

fourteen steps: and constructing a correction model.

Specifically, the process of constructing the correction model in this embodiment includes:

1) and extracting the Reads1 and the Reads2 sequences obtained in the step thirteen and the Reads with the same coordinate and the two sequencing Read length of more than or equal to 25bp by using python language, and respectively outputting the two files as T1(Read1) and T2(Read 2). Setting the IDs of the same coordinates in different files to be consistent when the Reads files are generated;

2) and (4) mutually aligning the corresponding Reads in the positions of T1 and T2, and marking the bases with the consistent and inconsistent two Reads in the alignment result to obtain Common Reads. Wherein, the position correspondence is realized by comparing whether the IDs of the two Reads are consistent or not;

3) documents T1 and T2 and Reference were made Mapping, respectively, to obtain Sam1 and Sam2 documents. And finding the longest common substring Ref Reads in the Reference by using the Reads with corresponding positions in Sam1 and Sam2 and mapping to the same position. The public substrings refer to areas covered by two corresponding Reads mapping;

4) compare Common Reads in step 2) with Ref Reads in step 3). For Base that is inconsistent in Common Reads, it is marked whether it is actually present in the Reference. If present, absence is determined for undetected Reads. If not, indicating Insertion for the measured Reads;

5) counting the Deletion and Insertion conditions in the step 4), and counting the types of bases before and after the inconsistent position. Thus, the probability of causing an attack or a Deletion before or after a different Base type is obtained.

Specifically, the naive bayes model used in this example is as follows:

wherein: p (D | XY) represents the probability of Deletion occurring for a base when X and Y bases are at the beginning and end, respectively, X, Y ∈ [ A, C, G, T ]. P (D) represents the probability of Deletion occurring for a base; p (I) represents the probability of an insert occurring for a base.

P (XY | D) and P (XY | I) can be obtained by counting the occurrence frequency of the front and back bases when Deletion or Insertion occurs under different bases, so that P (D | XY) and P (I | XY) can be obtained by calculation.

Step fifteen: the read length was filtered to yield Fa 1.

Specifically, the process of read length filtering in this embodiment includes:

reading all Reads in the Reads1 file line by using Python language, and outputting the text file Fa1 if the length of the Reads is more than or equal to 25 bp.

Sixthly, the steps are as follows: the Reads in Fa1 were classified according to Reads2 read length.

Specifically, the process described in this example for reading in Fa1 according to the Reads2 read Length Classification includes:

reading all Reads in Fa1 corresponding to Reads in Reads2, and storing corresponding Reads in Fa1 in a Fa2 file if the Read2 is more than or equal to 25bp according to the length of the Reads in Reads 2; if the 10bp is less than or equal to Read2 and less than 25bp, saving corresponding Reads in Fa1 in a Fa3 file.

Seventeen steps: and outputting confidence Reads according to the Q value.

Specifically, the process of re-outputting the confidence Reads according to the Q value in this example includes:

1) all the Reads in Fa2 from step sixteen are taken out, and their corresponding Reads in Reads2 are taken out at the same time. The values of the qualityscore (Q value for short) of the Reads are obtained by dividing the Reads ID.

2) The Q values of the two corresponding Reads are compared, and the read with the larger Q value is output to the file Fa4, and the read with the smaller Q value is output to the file Fa 5. If the Q values are equal, the Reads in Reads1 is output to Fa4 and the Reads in Reads2 is output to Fa5 by default.

Eighteen steps: the Reads in Fa4 and Fa5 were filtered according to Q-value.

Specifically, the process described in this example for filtering Reads in Fa4 and Fa5 according to Q-value includes:

if the Q value of Reads in Fa4 is greater than or equal to 60, the Reads in Fa5 corresponding to the Reads are output to file Fa6 and file Fa 7.

Nineteen steps: the results of Fa7 were used to correct the results of Fa6 to obtain Consensus results Reads Parts1 (CRP 1).

Specifically, the process described in this example using the Reads in Fa7 to correct the Reads in Fa6 includes:

1) take the Reads in Fa6 and its corresponding Reads in Fa 7. The two corresponding Reads are registered with each other, resulting in a common consensus sequence portion. Wherein the registration of the two sequences uses the Smith-Waterman algorithm, and the consistent sequence refers to a local optimal matching sequence obtained by adding, deleting or modifying part of Base in the sequences after the registration.

2) And after the consistency sequence is obtained, judging inconsistent Base positions in the consistency sequence one by one according to the correction model constructed in the step fourteen. And calculating the probability of Deletion or Insertion at the position according to the Base types before and after the Base position. If the probability of Deletion is more than 50%, the Base measured at the position is considered not to be present, and the Base at the position is deleted. Otherwise, the Base at that position is retained.

3) After all inconsistent Base is corrected, the corrected Reads, namely CRP1, is output. Inconsistent Base here refers to Base that is not measured at the same time in the two corresponding Reads. If the Base is measured twice, but the Base types are not consistent and are not within the candidate range for correction in this example, in this case, the final Base type is based on the Base type of Reads in Fa 6.

Twenty steps: the Reads in Fa3 were filtered according to Q.

Specifically, the process described in this example for filtering Reads in Fa3 according to Q-value includes:

the Q value of each Reads is obtained by taking all the Reads in Fa3 and dividing the Reads ID of the Reads in Fa 3. Reads with a Q value of 60 or more are output to the file Fa 8.

Twenty one: the Reads in the corresponding Reads2 in the Fa8 file are output.

Specifically, the process described in this example for outputting Reads in Reads2 from Reads in Fa8 includes:

all the Reads in the Fa8 file are taken out, the corresponding Reads in the Reads2 are taken out, and the corresponding Reads are output to the Fa9 file.

Step twenty-two: the results of Fa9 were used to correct the results of Fa8 to obtain Consensus results Reads Parts2 (CRP 2).

Specifically, the procedure described in this example using the Reads in Fa9 to correct the Reads in Fa8 was performed with reference to the nineteen steps described.

Twenty-three steps: and combining and outputting the Reads in the CRP1 and the CRP2 which meet the similarity threshold according to the requirements of different applications on the accuracy of sequencing data to obtain the Consensuss Reads.

Specifically, the process of filtering and outputting the Reads in the Consensus Reads Part according to the accuracy requirement of the sequencing data by the different applications in this example includes:

1) and setting corresponding similarity threshold values according to the requirements of different applications on the accuracy of sequencing data. Wherein the similarity thresholds for Part1 and Part2 may be different;

2) the similarities of the Reads in CRP1 and CRP2, which refers to the similarity of the corresponding Reads in Reads1 and Reads2, were calculated separately. The similarity calculation step is to register the two corresponding Reads with each other. And then calculating the ratio of the consistent Base number in the consistent sequence obtained by registration to the total Base number. Wherein the registration method, the consistency sequence and the inconsistent Base definition refer to nineteen steps.

3) According to the requirements of different applications on the accuracy of sequencing data, the Reads meeting the requirement of the similarity threshold in the CRP1 and CRP2 are respectively output to a final file to obtain the Consensus Reads, and refer to Table 4.

TABLE 4 comparison of different similarity threshold filtered output sequences with reference genomic mapping analysis

Note: data loss occurs mainly at the Read length filtering step, and since Read1 and Read2 sequencing are independent events, there must be partial Read length inconsistent sequences.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Sequence listing

<110> Shenzhen Zhenzhiji Biotech Limited

<120> sequencing result analysis method, system, computer-readable storage medium, and electronic device

<130> PIDC4200100

<160> 4

<170> PatentIn version 3.5

<210> 1

<211> 51

<212> DNA

<213> Artificial Sequence

<220>

<223> primer

<400> 1

ctcagatcct acaacgacgc tctaccgatg aagatgtgta taagagacag t 51

<210> 2

<211> 51

<212> DNA

<213> Artificial Sequence

<220>

<223> primer

<400> 2

ctgtctctta tacacatctg agtggaactg gatggtcgca ggtatcaagg a 51

<210> 3

<211> 43

<212> DNA

<213> Artificial Sequence

<220>

<223> primer

<400> 3

ctacaacgac gctctaccga tgaagatgtg tataagagac agt 43

<210> 4

<211> 62

<212> DNA

<213> Artificial Sequence

<220>

<223> Probe sequence

<400> 4

tttttttttt tccttgatac ctgcgaccat ccagttccac tcagatgtgt ataagagaca 60

gt 62

32页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:测序方法及其分析方法和系统、计算机可读存储介质和电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!