Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies

文档序号:1253961 发布日期:2020-08-21 浏览:41次 中文

阅读说明:本技术 一种基于二代和三代ont技术进行宏基因组组装方法 (Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies ) 是由 郑洪坤 龚雪情 王凡 于 2020-04-02 设计创作,主要内容包括:本发明实施例提供一种基于二代和三代ONT技术进行宏基因组组装方法,方法包括:利用三代ONT测序的reads进行自身比对,找到不同数据间的重叠部分overlap;根据不同数据间的重叠部分overlap对不同数据进行组装,得到组装后的数据;利用三代数据对所述组装后的数据进行自身纠错;利用二代测序继续进行纠错,得到最终的组装结果。本发明实施例先利用ONT三代测序的reads进行自身比对,找到数据间的overlap进行组装,组装后利用三代数据进行自身纠错,之后再利用二代测序继续进行纠错,从而提高组装结果的准确性。(The embodiment of the invention provides a metagenome assembly method based on second-generation and third-generation ONT technologies, which comprises the following steps: utilizing third-generation ONT sequencing reads to carry out self comparison, and finding out the overlapping part overlap among different data; assembling different data according to the overlap part overlap among the different data to obtain the assembled data; utilizing the third generation data to carry out self error correction on the assembled data; and continuously correcting the errors by using the second-generation sequencing to obtain a final assembly result. According to the embodiment of the invention, firstly, the reads of the ONT third-generation sequencing are used for carrying out self-comparison, overlap among data is found for assembling, the third-generation data is used for carrying out self-error correction after the assembly, and then the second-generation sequencing is used for continuing error correction, so that the accuracy of the assembling result is improved.)

1. A metagenome assembly method based on second generation and third generation ONT technology is characterized by comprising the following steps:

utilizing third-generation ONT sequencing reads to carry out self comparison, and finding out the overlapping part overlap among different data;

assembling different data according to the overlap part overlap among the different data to obtain the assembled data;

utilizing the third generation data to carry out self error correction on the assembled data;

and continuously correcting the errors by using the second-generation sequencing to obtain a final assembly result.

2. The assembly method of claim 1, wherein the self-alignment is performed by using reads of three generations of ONT sequencing, and the finding of the overlap portion overlap between different data comprises:

using minimap2 software by dividing the sequencing data into k-length kmer groups;

selecting two kmers groups with the minimum z value from a plurality of adjacent kmers groups by adopting a minimizers method;

if two kmer group sequences have overlap, the two sequences have the same representative kmer;

minimizers with co-linearity are grouped into kmers by using single-strand clustering;

the largest co-linear minizers subset, i.e., the map result of minimap, is obtained by solving the longest increasing sequence problem.

3. The assembling method according to claim 1, wherein the assembling different data according to the overlap portion overlap among different data, and obtaining the assembled data comprises:

checking the mapping relation between the reads by utilizing minism-master software, and removing the joint and the chimera;

calculating the coverage of each base of each read based on each read meeting the preset condition mapping relation with all other reads, and selecting the longest region with the coverage not less than 3;

after pruning reads, constructing an assembly graph by analyzing the map relation between two sequences with overlap;

removing transitional edges by using a minism method, trimming units containing less than 4 reads, and popping out small bubbles;

and serially combining a plurality of adjacent assembly graphs into one unit, wherein the unit is the largest path in the plurality of adjacent combinable assembly graphs.

4. The assembly method according to claim 3, wherein each read meeting the preset condition mapping relation with all other reads is a read with a length larger than 2K and a length of a non-overlap area larger than 100 on a matched minimizers.

5. The assembly method of claim 1, wherein said self-correcting said assembled data using three generations of data comprises:

finding the mapping relation between the original three generations of reads and the initially assembled units through minimap software;

loading original third-generation reads through Racon software, and simply filtering based on overlap information obtained by comparing primarily assembled contigs and minimap;

dividing the reserved reads into chunks in a window without overlap in a main stem sequence, and performing quick comparison based on editing distance;

each window constructs a POA graph and calls the assensus of the window. (ii) a

And splicing the consensus of each window to obtain the final consensus.

6. The assembly method according to claim 5, wherein the performing simple filtering based on overlap information after the contigs and minimap comparison after the initial assembly comprises:

only one overlap is reserved for each read, and the overlap with high error rate is removed.

7. The assembly method of claim 1, wherein the proceeding with error correction by second-generation sequencing to obtain a final assembly result comprises:

comparing the corrected assembly results of the second generation reads and the third generation reads through Bwa, establishing an index to obtain a comparison result through sorting, merging and marking repetition, and performing policy on the assembly result according to the comparison result by using pilot software to obtain a final assembly result.

8. The assembly method of claim 7, wherein the assembly result after the third generation reads correction is corrected 20 times using the second generation reads.

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a metagenome assembly method based on second-generation and third-generation ONT technologies.

Background

Metagenome avoids pure culture technology to ascertain the diversity and functions of microorganisms, and provides a new technology for discovering new genes, developing new microbial active substances and researching microbial community structures and functions thereof. Second-generation sequencing has the advantages of high-quality data, low sample requirement, simple operation flow and the like, but the reading length and amplification preference exist, and great challenges are brought to assembly. The third generation of ONT sequencing realizes long reading length, simultaneously reduces the sequencing cost, but the sequencing is not accurate enough, and the combination of the third generation of ONT and the second generation of ONT can greatly improve the assembly length.

Disclosure of Invention

To overcome the existing problems or at least partially solve the problems, embodiments of the present invention provide a metagenome assembly method based on second-generation and third-generation ONT technologies.

The embodiment of the invention provides a metagenome assembly method based on second-generation and third-generation ONT technologies, which comprises the following steps:

utilizing third-generation ONT sequencing reads to carry out self comparison, and finding out the overlapping part overlap among different data;

assembling different data according to the overlap part overlap among the different data to obtain the assembled data;

utilizing the third generation data to carry out self error correction on the assembled data;

and continuously correcting the errors by using the second-generation sequencing to obtain a final assembly result.

On the basis of the above technical solutions, the embodiments of the present invention may be further improved as follows.

Optionally, the performing self-alignment by using third-generation reads for ONT sequencing to find the overlap portion overlap among different data includes:

using minimap2 software by dividing the sequencing data into k-length kmer groups;

selecting two kmers groups with the minimum z value from a plurality of adjacent kmers groups by adopting a minimizers method;

if two kmer group sequences have overlap, the two sequences have the same representative kmer;

minimizers with co-linearity are grouped into kmers by using single-strand clustering;

the largest co-linear minizers subset, i.e., the map result of minimap, is obtained by solving the longest increasing sequence problem.

Optionally, the assembling different data according to the overlap portion overlap between different data, and obtaining the assembled data includes:

checking the mapping relation between the reads by utilizing minism-master software, and removing the joint and the chimera;

calculating the coverage of each base of each read based on each read meeting the preset condition mapping relation with all other reads, and selecting the longest region with the coverage not less than 3;

after pruning reads, constructing an assembly graph by analyzing the map relation between two sequences with overlap;

removing transitional edges by using a minism method, trimming units containing less than 4 reads, and popping out small bubbles;

and serially combining a plurality of adjacent assembly graphs into one unit, wherein the unit is the largest path in the plurality of adjacent combinable assembly graphs.

Optionally, each read meeting the preset condition mapping relationship with all other reads is a read with a length greater than 2K and a length of a non-overlap region greater than 100 on the matched minimizers.

Optionally, the performing self-error correction on the assembled data by using the third generation data includes:

finding the mapping relation between the original three generations of reads and the initially assembled units through minimap software;

loading original third-generation reads through Racon software, and simply filtering based on overlap information obtained by comparing primarily assembled contigs and minimap;

dividing the reserved reads into chunks in a window without overlap in a main stem sequence, and performing quick comparison based on editing distance;

each window constructs a POA graph and calls the assensus of the window. (ii) a

And splicing the consensus of each window to obtain the final consensus.

Optionally, the simply filtering based on overlap information after comparing contigs and minimap after the initial assembly includes:

only one overlap is reserved for each read, and the overlap with high error rate is removed.

Optionally, the performing error correction by using second-generation sequencing to obtain a final assembly result includes: comparing the corrected assembly results of the second generation reads and the third generation reads through Bwa, establishing an index to obtain a comparison result through sorting, merging and marking repetition, and performing policy on the assembly result according to the comparison result by using pilot software to obtain a final assembly result.

Optionally, the assembly result after the third generation reads correction is corrected 20 times by using the second generation reads.

The embodiment of the invention provides a method for performing self-comparison by using the reads of the ONT third-generation sequencing, finding the overlap among data for assembly, performing self-error correction by using the third-generation data after assembly, and then continuing error correction by using the second-generation sequencing, thereby improving the accuracy of an assembly result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic overall flow chart of a method for assembling a metagenome based on second-generation and third-generation ONT technologies according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

In an embodiment of the present invention, a method for performing metagenome assembly based on second-generation and third-generation ONT technologies is provided, and fig. 1 is a schematic overall flow chart of the method for performing metagenome assembly based on second-generation and third-generation ONT technologies, which is provided in the embodiment of the present invention, and the method includes:

s1, comparing the self by using reads of the third generation ONT sequencing to find the overlap part among different data;

s2, assembling different data according to the overlap part overlap among the different data to obtain the assembled data;

s3, utilizing the third generation data to carry out self error correction on the assembled data;

and S4, continuously correcting the errors by using the second-generation sequencing to obtain a final assembly result.

As an alternative embodiment, the self-alignment using reads of the third generation ONT sequencing to find the overlapping part overlap between different data includes:

as an alternative embodiment, the self-alignment using reads of the third generation ONT sequencing, and finding the overlap portion overlap between different data includes:

sequencing data are divided into a plurality of k-length kmer groups by using minimap2(v2.11) software, a kmers group with the minimum z value is selected from a plurality of adjacent kmers groups by using a minisizers method to serve as a representative kmer group, and if two kmer group sequences have overlap, the two sequences have the same representative kmer group. Obtaining minimizers with collinearity by using a single-strand clustering method; the maximum co-linear subset, namely the map result of minimap, is obtained by solving the longest increasing sequence problem, namely the overlapping part overlap between different data is found.

As an optional embodiment, the assembling different data according to the overlap portion overlap between different data, and obtaining the assembled data includes:

and (3) checking the mapping relation between the reads by utilizing minism-master (v0.2-r168-dirty) software, and removing the joints and chimeras and the like. Each read is based on a better mapping relation with all other reads, wherein when the length of the read is larger than 2K, and the length of the non-overlap area on the matched minizers is larger than 100, the read has a better mapping relation with all other reads.

For each set of reads with a good mapping relation with all other reads, calculating each base coverage of each read, and selecting the longest region with the coverage not less than 3; for pruned reads, an assembly graph (assembly graph) is constructed by analyzing the map relationship between the two sequences (overlap exists between the two reads, one read contains the other). Then utilizing a minism method to remove the transitional edges, trimming the units (tipinh units) containing less than 4 reads, and popping out small bubbles. Under the condition of not influencing the connectivity of the original assembly graph, a plurality of adjacent assembly graphs are serially combined into one unit (the unit is the maximum path in the plurality of adjacent assembly graphs which can be clearly combined), and the assembled data can be obtained.

As an alternative embodiment, performing self-error correction on the assembled data by using three generations of data includes:

finding the mapping relation between the original three-generation reads and the initially assembled units through minimap software, loading the original three-generation reads through Racon (v1.2.1) software, and based on overlap information obtained after comparing the initially assembled contigs with the minimap, firstly performing a simple filtering method to only reserve one overlap for each read and remove the overlap with high error rate. The retained reads are sorted into chunks in windows without overlap in the stem sequence for a fast edit distance based alignment. Each window then builds a POA map (POA graph) and calls (culling) the consensus of this window. And finally, splicing the consensus of each window to obtain the final consensus and obtain the error-corrected data.

As an alternative embodiment, the error correction is continued by using next generation sequencing, and the obtaining of the final assembly result comprises:

bwa is used for comparing the assembly results after the second generation reads and the third generation reads are corrected, the comparison results are obtained by establishing indexes through sorting, merging and marking repetition, and the final assembly results are obtained by using the pilot software to carry out polling on the assembly results according to the comparison results. Wherein, the second generation reads are used for correcting the assembly result after the third generation reads are corrected for 20 times, thereby achieving the effect of removing a large amount of SNP and Indel, and the removal ratio is as high as 99.9%.

The method for assembling a metagenome based on the second-generation and third-generation ONTs technologies according to the embodiments of the present invention is described in the following with reference to two specific embodiments.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种高重复原鮡属鱼类的染色体级别组装的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!