Method for assembling genome de novo by comprehensively applying third-generation ultralong sequencing reads and second-generation linked reads

文档序号:1398283 发布日期:2020-03-03 浏览:35次 中文

阅读说明:本技术 综合应用第三代超长测序读段和第二代链接式读段从头组装基因组的方法 (Method for assembling genome de novo by comprehensively applying third-generation ultralong sequencing reads and second-generation linked reads ) 是由 马占山 张亚平 李连伟 彭旻晟 于 2018-08-11 设计创作,主要内容包括:本发明公开了一种综合应用第三代超长测序读段(Ultra-long reads)和第二代链接式读段(Linked-reads)高效率、高质量从头组装(de Novo)基因组的方法。其中第三代超长测序读段是指通过目前应用最为广泛的第三代测序技术Nanopore和PacBio产生的超长测序读段;第二代链接式读段是指10x Genomics测序平台产生的测序读段,通过高效率的混合组装软件组装出高质量基因组序列。该方法充分发挥了第三代超长测序读段和第二代链接式读段的优势,综合二者之长,结合高效率的组装软件——DBG2OLC和SPARC大幅度降低了第三代测序技术的应用成本。为应用第三代测序技术进行大规模、高质量的基因组从头组装提供了一种高效、可靠、经济的方法。(The invention discloses a method for assembling a (de Novo) genome from the beginning with high efficiency and high quality by comprehensively applying third-generation Ultra-long sequencing reads (Ultra-long reads) and second-generation Linked reads (Linked-reads). Wherein the third generation of ultra-long sequencing reads refer to the ultra-long sequencing reads generated by the currently most widely applied third generation sequencing technologies, Nanopore and PacBio; the second generation of linked reads refer to sequencing reads generated by a 10x Genomics sequencing platform, and high-quality genome sequences are assembled by high-efficiency hybrid assembly software. The method gives full play to the advantages of the third generation ultra-long sequencing read and the second generation link read, integrates the advantages of the third generation ultra-long sequencing read and the second generation link read, and greatly reduces the application cost of the third generation sequencing technology by combining high-efficiency assembly software, namely DBG2OLC and SPARC. Provides an efficient, reliable and economic method for large-scale and high-quality genome first-order assembly by applying a third-generation sequencing technology.)

1. A method for assembling a genome de novo by comprehensively applying third-generation Ultra-long sequencing reads (Ultra-long reads) and second-generation Linked reads (Linked-reads) with high efficiency, high quality and low cost.

2. The method of claim 1, wherein the third generation ultralong sequencing reads are ultralong sequencing reads determined by PacBio, Nanopore sequencing technology, or other third generation sequencing technology.

3. The method of claim 1, wherein the second generation sequencing data is a chained sequencing read generated by sequencing on a 10x Genomics sequencing platform.

4. The method of claim 1, wherein third generation ultralong sequencing reads and second generation chained reads are assembled de novo using hybrid assembly software-DBG 2OLC and SPARC.

5. The method according to any one or more of claims 1-4, wherein: products that implement their concepts, algorithms and functions in any form of software, firmware and/or hardware, including various medical instruments, scopes, to provide services.

6. Any algorithms and software developed according to the integrated method of claims 1, 2, 3.

7. The method of any one or more of claims 1, 2, 3, 6, wherein: products that implement their concepts, algorithms and functions in any form of software, firmware and/or hardware, including various medical instruments, scopes, to provide services.

Technical Field

The invention relates to a method for assembling genome sequencing data from the head, in particular to a method for mixing and assembling third generation sequencing data and second generation sequencing data. Third generation sequencing data were mostly ultralong reads generated by Pacbio, Nanopore, or other sequencing technologies, and second generation sequencing data were mostly chained reads generated by 10 Xgenomics sequencing. By combining with high-efficiency assembly software, namely DBG2OLC, the sequencing cost and the calculation cost (especially the application cost of the third-generation sequencing technology) are greatly reduced. Provides an efficient and reliable method for large-scale and high-quality genome first-order assembly by applying a third-generation sequencing technology

Background

With the development of sequencing technology, the genome sequence information generated by the de novo genome assembly is more and more detailed and accurate. The 10 Xgenomics sequencing technology developed in recent years can assemble homologous chromosomes de novo with lower cost. The conventional second generation sequencing technology generates sequencing short reads (150-300bp) which are difficult to solve the problem of assembling a large number of repeated sequences in a genome, and the repeated sequences enable the assembly to generate a large number of short fragments. The 10X sequencing technology is based on the Illumina sequencing technology, a label (barcoding) is added to a gene fragment to be detected, the label marks the source (chained read) of a sequencing read, the complexity of gene assembly is greatly reduced, and the possibility of mismatching of the sequencing reads from different sources is avoided. The 10X sequencing technology solves the problems caused by the repeated sequence of the traditional second generation sequencing technology, and can assemble a longer genome sketch.

The third generation sequencing technology overcomes the defects of the traditional second generation sequencing technology and can generate ultra-long sequencing reads. Pacbio sequencing platform from Pacific biosciences and Nanopore sequencing platform from Oxford Nanopore sequencing, England are currently the most widely used third generation sequencing technologies. The generated ultralong reads can also solve the splicing problem of repeated sequences in the genome. Wherein the read length generated by the Nanopore sequencing technology reaches hundreds of kbps, even can reach M bp level.

However, the assembly problem caused by the high error rate of the third generation sequencing technology becomes an obstacle to the wide-range popularization of the third generation sequencing technology. The sites of errors in the third generation sequencing data are random and such errors can be corrected by increasing the sequencing Coverage (Coverage), but an increase in Coverage results in an increase in sequencing data, thereby increasing sequencing and computational costs. Although Pacbio and Nanopore sequencing technologies have been successfully applied to de novo sequencing of genomes, the high sequencing and computational costs have hindered the large-scale application of third generation sequencing technologies.

The defects of high error rate of third-generation data and short reading section of the second-generation data are overcome to a certain extent by comprehensively applying a mixed assembly strategy of the second-generation sequencing data and the third-generation sequencing data. The second generation sequencing data and the third generation sequencing data mutually make up for the deficiencies, the assembly efficiency and the accuracy are improved on the basis of ensuring the assembly quality, and the sequencing cost and the calculation cost are reduced. The DBG2OLC software developed aiming at the hybrid assembly strategy is efficient hybrid assembly software. The DBG2OLC software is a high-efficiency hybrid assembly software that the inventors participated in co-development (title of the invention: method, system and apparatus for assembling genomic sequences, application No. 201510084489X, which has entered the substantial examination stage and was published on 2016, 10, 5). According to the invention, the DBG2OLC software is used for assembling the ultralong reading section generated by the third-generation sequencing technology and the link-type reading section generated by the 10X Genomics sequencing, so that the length of the ultralong reading section and the link-type reading section is integrated, and the sequencing cost and the calculation cost are greatly reduced.

Disclosure of Invention

The invention aims to:

a method for efficiently and cost-effectively assembling a genome de novo using a combination of ultra-long reads of third generation sequencing data and linked reads of second generation 10X Genomics is provided. Due to the lower sequencing and calculation cost of the method, the method can be used for de Novo (de Novo) sequencing and assembling of the genome of multiple species of population or community in a large scale, and provides technical support for the popularization of the third generation sequencing technology.

The technical scheme adopted by the invention is as follows:

the invention comprehensively applies the ultra-long reading of the third generation sequencing data and the link reading of the second generation 10X Genomics to assemble the genome de novo with high efficiency and low cost, and the main technical process is divided into 4 steps (shown in figure 1):

(1) the 10 Xgenomics linked reads were assembled using Supernova software. The 10X Genomics sequencing platform adopts bar codes to mark different DNA fragments to be detected, and the linked reads generated by sequencing are characterized in that the reads from the same source can be spliced together again through the bar codes, thereby avoiding the mismatching between different DNA fragments and greatly reducing the complexity of calculation. The software assembles the Scaffold directly.

(2) The Scaffold assembled from Supernova was converted to Contigs. Reconverting the Scaffold assembled in the previous step into Contigs for the next DBG2OLC assembly (the DBG2OLC assembly needs to use Contigs data, and cannot input Scaffold data)

(3) Assemble Contigs generated in (2) and ultralong reads generated by third generation sequencing technology using DBG2 OLC. DBG2OLC is a piece of software that can efficiently mix and assemble second-generation and third-generation sequencing data. The software requires Contigs assembled from second generation sequencing data and raw data from third generation sequencing. And correcting the error of the third generation sequencing data through the second generation sequencing data. In the process, the long Contigs assembled by 10 Xgenomics enables the process of calculating the overlapping region by the third generation sequencing data to be more efficient and reliable, thereby greatly improving the assembly effect and the calculation efficiency.

(4) And (4) performing Consenssus on the assembly result in the step (3) by using spark software.

The invention has the following effects:

a method for efficiently and cost-effectively assembling a genome de novo using a combination of ultra-long reads of third generation sequencing data and linked reads of second generation 10X Genomics is provided. The method fully utilizes the advantages of the second generation 10X Genomics sequencing link-type reads and the third generation ultra-long reads, and a high-quality genome sequence is assembled by high-efficiency mixed assembly software DBG2 OLC. The method has low sequencing cost and calculation cost and high-quality assembly results, and lays a foundation for the application of third-generation sequencing data.

Drawings

FIG. 1 is a flow chart of the calculation process of the present invention, which is mainly divided into four steps. The involved software is mainly Supernova, DBG2OLC and Sparc, and Supernova is software specially used for assembling 10X Genomics link reads. DBG2OLC is software for efficiently assembling second-generation and third-generation data, spark is software for efficiently carrying out long-reading comparison, and accordingly Consenssus is carried out on an assembly result.

FIG. 2 is the cost of sequencing of data used in the present invention. The abscissa represents sequencing coverage and the ordinate represents sequencing cost (in dollars). In the figure, the thin line represents the use of 35 xAnopore sequencing data sequencing cost, thick line represents 56x link read +7x Nanopore sequencing read sequencing cost (the invention of the use of the sequencing data). It can be seen from the figure that the sequencing cost of the sequencing protocol of the present invention is much lower than that of the 35 × Nanopore sequencing data.

FIG. 3 is a comparison of the results of the assembly of the method used in the present invention with 35 × Nanopore sequencing data. And sorting the assembled Contigs according to the length from large to small, and accumulating the Contigs one by one from the first Contigs. The abscissa in fig. 3 represents the number of Contigs and the ordinate represents the length of the accumulation sequence. In FIG. 3, the bold lines represent the cumulative curve of the assembly results for the 56x linked reads +7xNanopore sequencing reads and the thin lines represent the cumulative curve of the assembly results for the 35x Nanopore sequencing data according to the method of the present invention. It can be seen from fig. 3 that the two methods Contigs have the same accumulation length when accumulating to the 215 th Contigs sequence. The figure marks the Contigs lengths and sequence numbers of N25, N50, N75.

Detailed Description

We used human genes to verify the efficacy of the methods of the invention. And assembling Scaffold by Supernova by adopting 56X second generation chained read sequencing data, then converting the Scaffold into Contigs, and selecting 7X third generation sequencing ultralong read. The DBG2OLC mixture was used to assemble the very long reads of Contigs and 7X. In addition, we used the assembly results of the second generation linked reads, and the assembly results of the third generation sequencing reads at 30X and 35X to compare the assembly effect and sequencing cost of the present invention (see Table 1 for results).

TABLE 1.10 comparison of Xgenomics chained read Assembly results, Nanopore sequencing data Assembly results and Mixed Assembly results

Method of use of the invention

The total length in Table 1 is the total length of the assembled genome sequence, and the total length of human is 3,000,000,000bp, and the larger this value indicates the more complete the assembled gene. The number of sequences indicates the number of Contigs or scaffolds assembled. Our method assembled the genome length next to the Nanopore (35X) results and assembled the contigsniumber the least, so our results made the largest average and median sequence length. It shows that Contigs we have assembled are all longer and have no particularly short sequences.

The longest reading in table 1 indicates the length of the longest Contigs assembled, which is an important measure of the assembly effect. The value of 10X in table 1 is the largest but there is a Gap (Gap) in the Scaffold (i.e. a missing fragment between two Contigs that does not sequence a particular sequence, but the length of the sequence is known and is generally complemented by a "N" of known length). The second is the assembly of Nanopore (35X), which is an advantage of three-generation long read sequencing, which can assemble longer Contigs without gaps. Our method produced the longest Contigs values that were about 30% shorter than the Nanopore (35X) results, but we used only 7X Nanopore data (1/5 with Nanopore (35X) data) and higher than the Nanopore (30X) results.

The three values of N50, N80 and N90 in Table 1 reflect the distribution of the assembled Contigs lengths. All Contigs are sorted by length from large to small and then accumulated starting with the longest Contigs, and when the accumulated length exceeds 50%, 80%, 90% of the total length, the length of Contigs on the last accumulation is N50, N80, N90. In Table 1, the three largest values are all 10X assembly results, and there are gaps in their sequences as in the case of the longest Contigs length. Our method produced higher than Nanopore (30X) results for all three values. Although the value of N50 in our results was slightly less than that of Nanopore (35X), the values of N80 and N90 were all higher than that of Nanopore (35X).

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:产生化学结构的方法、神经网络设备和非瞬时计算机可读的记录介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!