Genome assembly method for chlamydomonas chromosome level

文档序号:193429 发布日期:2021-11-02 浏览:62次 中文

阅读说明:本技术 一种衣藻染色体水平的基因组组装方法 (Genome assembly method for chlamydomonas chromosome level ) 是由 王宝祺 马飞学 于 2021-07-01 设计创作,主要内容包括:本发明涉及基因组组装技术领域,公开了一种衣藻染色体水平的基因组组装方法。该方法通过依次采用FGAP、原始三代测序数据序列比对、初步组装后的DNA片段比对,关闭原始基因组中的200bp以下、200-1000bp、1000bp以上的gap区域,能够在较大程度上提高基因组的完整性,同时兼顾了测序成本;并且,在根据比对结果关闭gap区域过程所使用的python脚本中,从序列比对结果中可能存在的3种情况考虑,分别采用不同的gap关闭方式,能够关闭大部分gap区域,使基因组具有更高的完整性。(The invention relates to the technical field of genome assembly, and discloses a genome assembly method for chlamydomonas chromosome level. The method closes the gap regions below 200bp, above 1000bp and below 200-plus-one in the original genome by sequentially adopting FGAP, original third-generation sequencing data sequence comparison and DNA fragment comparison after preliminary assembly, can improve the integrity of the genome to a greater extent and simultaneously considers the sequencing cost; in addition, in the python script used in the process of closing the gap region according to the alignment result, considering 3 possible situations in the sequence alignment result, different gap closing modes are respectively adopted, so that most of the gap region can be closed, and the genome has higher integrity.)

1. A genome assembly method of chlamydomonas chromosome level is characterized by comprising the following steps:

(1) modifying an original genome obtained by the second-generation sequencing of the chlamydomonas by using an FGAP tool by using an original third-generation sequencing data sequence of the chlamydomonas to close a gap region with the length of less than 200bp and obtain a preliminarily modified genome;

(2) comparing the original third-generation sequencing data sequence with the preliminarily modified genome, and closing a gap region with the length of 200-1000bp by using a python script to obtain a second modified genome;

(3) according to the overlapping area between reads, carrying out primary assembly on the original third-generation sequencing data sequence to obtain a DNA fragment after the primary assembly;

(4) and comparing the DNA fragment after the primary assembly with the genome after the secondary modification, extracting a required sequence by using a python script according to the comparison result, and closing a gap region with the length of more than 1000bp to realize genome assembly.

2. The genome assembly method of claim 1, wherein in the steps (2) and (4), the specific method for closing the gap region by using the python script is as follows:

case 1: a matching region 1 and a matching region 2 exist between the material sequence A and a sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the upstream and downstream position relation of the matching region 1 and the matching region 2 in the sequence to be modified is the same as that in the material sequence A, a matching region exists at one end of the material sequence B and one end of the gap region, and the sequence in the material sequence A at the position same as that of the gap region is selected to fill the gap region;

in the step (2), the original three-generation sequencing data sequence is used as a material sequence, and the preliminarily modified genome is used as a sequence to be modified; in step (4), the preliminarily assembled DNA fragments are used as material sequences, and the second modified genome is used as a sequence to be modified.

3. The genome assembly method of claim 2, wherein in the steps (2) and (4), the specific method for closing the gap region by using the python script is as follows:

case 2: a matching region 1 and a matching region 2 exist between the material sequence A and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the upstream-downstream position relation of the matching region 1 and the matching region 2 in the sequence to be modified is opposite to the upstream-downstream position relation in the material sequence A, a matching region 3 exists at one end of the material sequence B and the gap region, and the sequence at the same position as the gap region in the material sequence B is selected to fill the gap region.

4. The genome assembly method of claim 3, wherein in the steps (2) and (4), the specific method for closing the gap region by using the python script is as follows:

case 3: a matching region 1 and a matching region 2 exist between the material sequence A and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the matching region 1 and the matching region 2 are overlapped in the material sequence A, a matching region 3 exists at one end of the material sequence B and one end of the gap region, and the sequence at the same position as the gap region in the material sequence B is selected to fill the gap region.

5. The genome assembly method according to claim 2, wherein in case 1, there is a mismatch region between the sequence from the gap region to the matching region 1 and/or the matching region 2 in the sequence to be modified and the sequence at the corresponding position in the material sequence A, and the mismatch region in the sequence to be modified is not modified when filling the gap region.

6. The genome assembly method according to claim 3 or 4, wherein in case 2 or case 3, the sequence from the gap region to the matching region 3 in the modified sequence has a mismatch region with the sequence at the corresponding position in the material sequence B, and the mismatch region in the sequence to be modified is not altered when filling the gap region.

7. The method for genome assembly according to claim 1, wherein in the step (1), the original third generation sequencing data sequence of Chlamydomonas is obtained by a Nanopore sequencing technique.

8. The method of genome assembly according to claim 1, wherein in step (2), the original three-generation sequencing data sequence is aligned to the initially modified genome using Minimap tool.

9. The method of genome assembly according to claim 1, wherein in step (3), the primary assembly of the sequence of raw sequencing data is performed using a Necat or Smartdenovo or Canu tool.

10. The method of genome assembly according to claim 1, wherein in step (4), the preliminarily assembled DNA fragments are aligned with the second modified genome using the MUMmer tool.

Technical Field

The invention relates to the technical field of genome assembly, in particular to a genome assembly method for chlamydomonas chromosome level.

Background

In the traditional genome construction (second-generation sequencing), sample DNA is generally broken into small fragments with the length of 100-plus 200 bases by ultrasonic waves, then sequencing is carried out by an NGS high-throughput sequencer to obtain off-line data, short sequences are pairwise compared by using assembly software, a contig is constructed through an overlapping region between the short sequences to form longer genome fragments (contigs) which are preliminarily assembled, then the long fragments are further assembled and combined by a chromosome positioning technology, and the chromosome number is marked to realize the genome construction on the chromosome level, namely to obtain the DNA sequence (scaffold) positioned on the chromosome. However, due to the restriction of the length of the sequenced fragment and the influence of the level of assembly, there are some difficult fragments to detect on the DNA, commonly referred to as gap in the genome. The existence of the gap region provides a great obstacle to the analysis of genomics.

With the emergence of the third generation single molecule sequencing technology, the eosin is brought for solving the problem, and because the third generation sequencing technology does not need DNA amplification and a breaking process, and the sequencing fragment is overlong, can reach 1M (100 ten thousand) bases at most, the initial assembly is simpler, and can easily cover the long fragment gap region in the traditional genome. Thus, with third generation sequencing data, gap shutdown can be performed on the traditional genome obtained from second generation sequencing.

The document "FGAP an automated gap closing tool" (Piro, vitar C., et al BMC research nodes 7.1(2014):371.) discloses a gap closing tool FGAP which performs the closing of gap regions by aligning contigs sequences onto genomic draft sequences using BLAST and searching for optimal sequences overlapping the gap regions. The small gap region can be closed more accurately by using the tool, but the large gap region is difficult to close, so that the obtained whole genome has lower integrity.

Disclosure of Invention

In order to solve the technical problems, the invention provides a genome assembly method of chlamydomonas chromosome level. The method adopts FGAP, original third-generation sequencing data sequence comparison and DNA fragment comparison after preliminary assembly in sequence, closes gap regions below 200bp, above 200bp and above 1000bp in an original genome, can improve the integrity of the genome to a greater extent and simultaneously considers the sequencing cost.

The specific technical scheme of the invention is as follows:

a method for genome assembly at the chromosome level of chlamydomonas comprising the steps of:

(1) modifying an original genome obtained by the second-generation sequencing of the chlamydomonas by using an FGAP tool by using an original third-generation sequencing data sequence of the chlamydomonas to close a gap region with the length of less than 200bp and obtain a preliminarily modified genome;

(2) comparing the original third-generation sequencing data sequence with the preliminarily modified genome, and closing a gap region with the length of 200-1000bp by using a python script to obtain a second modified genome;

(3) according to the overlapping area between reads, carrying out primary assembly on the original third-generation sequencing data sequence to obtain a DNA fragment after the primary assembly;

(4) and comparing the DNA fragment after the primary assembly with the genome after the secondary modification, extracting a required sequence by using a python script according to the comparison result, and closing a gap region with the length of more than 1000bp to realize genome assembly.

The third-generation sequencing reads the length and the overlength, the initial assembly is simpler, and the long fragment gap region in the traditional genome can be easily covered. Based on the original genome obtained by the traditional second-generation sequencing, the third-generation sequencing data of the corresponding species are utilized to close the gap region in the original genome, and because the second-generation sequencing locates the DNA fragment on the chromosome, the chromosome does not need to be located again, so that the integrity of the genome and the sequencing cost are considered to the maximum extent.

On the basis of closing a small gap region below 200bp by using FGAP, the invention utilizes the original third generation sequencing data sequence of the Chlamydomonas and the DNA fragment after the initial assembly to carry out comparison, and adopts the python script of the invention to realize the closing of the gap regions above 200-1000bp and 1000bp, thereby greatly improving the integrity of the genome.

Preferably, in steps (2) and (4), the specific method for closing the gap region by using the python script is as follows: case 1: a matching region 1 and a matching region 2 exist between the material sequence A and a sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the upstream and downstream position relation of the matching region 1 and the matching region 2 in the sequence to be modified is the same as that in the material sequence A, a matching region exists at one end of the material sequence B and one end of the gap region, and the sequence in the material sequence A at the position same as that of the gap region is selected to fill the gap region; in the step (2), the original three-generation sequencing data sequence is used as a material sequence, and the preliminarily modified genome is used as a sequence to be modified; in step (4), the preliminarily assembled DNA fragments are used as material sequences, and the second modified genome is used as a sequence to be modified.

In the python script, a material sequence for filling the gap area is selected according to the matching condition between the material sequence and the sequence to be modified.

In case 1, since the matching sequence exists at both ends of the gap region between the material sequence a and the sequence to be modified, and the matching region exists only at one end of the gap region between the material sequence B and the sequence to be modified, the gap region is filled with the material sequence a, and higher accuracy can be obtained.

Preferably, in steps (2) and (4), the specific method for closing the gap region by using the python script is as follows: case 2: a matching region 1 and a matching region 2 exist between the material sequence A and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the upstream-downstream position relation of the matching region 1 and the matching region 2 in the sequence to be modified is opposite to the upstream-downstream position relation in the material sequence A, a matching region 3 exists at one end of the material sequence B and the gap region, and the sequence at the same position as the gap region in the material sequence B is selected to fill the gap region.

In case 2, since the upstream and downstream positional relationship of the matching region 1 and the matching region 2 in the sequence to be modified is opposite to the upstream and downstream positional relationship in the material sequence a (for example, in the sequence to be modified, the matching region 1 is located upstream of the matching region 2, and in the material sequence a, the matching region 1 is located downstream of the matching region 2), it indicates that the matching region between the material sequence a and the sequence to be modified may be caused by mismatch, and thus the material sequence B is selected as the material for filling the gap region.

Preferably, in steps (2) and (4), the specific method for closing the gap region by using the python script is as follows: case 3: a matching region 1 and a matching region 2 exist between the material sequence A and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the matching region 1 and the matching region 2 are overlapped in the material sequence A, a matching region 3 exists at one end of the material sequence B and one end of the gap region, and the sequence at the same position as the gap region in the material sequence B is selected to fill the gap region.

In case 3, due to the overlap between the matching region 1 and the matching region 2, there is no sequence in the material sequence a that can be used to fill the gap region, and the matching region with the modified sequence may be caused by mismatch, so the material sequence B is selected as the material for filling the gap region.

When the python script contains the 3 cases, most of the gap region can be closed, so that the genome has higher integrity.

Further, in case 1, there is a mismatch area between the sequence from the gap region to the matching region 1 and/or the matching region 2 in the sequence to be modified and the sequence at the corresponding position in the material sequence a, and the mismatch area in the sequence to be modified is not changed when filling the gap region.

Further, in case 2 or case 3, there is a mismatch region between the sequence from the gap region to the matching region 3 in the modified sequence and the sequence at the corresponding position in the material sequence B, and the mismatch region in the sequence to be modified is not altered when filling the gap region.

Preferably, in step (1), the original third generation sequencing data sequence is obtained by a Nanopore sequencing technique.

Preferably, in step (2), the original three-generation sequencing data sequence is aligned to the initially modified genome using Minimap tool.

Preferably, in step (3), the primary sequencing data sequence is initially assembled using a Necat or Smartdenovo or Canu tool.

Preferably, in step (4), the preliminarily assembled DNA fragments are aligned with the second modified genome using the MUMmer tool.

Compared with the prior art, the invention has the following advantages:

(1) FGAP, original third-generation sequencing data sequence comparison and DNA fragment comparison after preliminary assembly are sequentially adopted, gap regions below 200bp, above 200bp and 1000bp in an original genome are closed, the integrity of the genome can be improved to a greater extent, and the sequencing cost is considered at the same time;

(2) in the python script used in the process of closing the gap region according to the alignment result, considering 3 possible cases in the sequence alignment result, different gap closing modes are respectively adopted, so that most of the gap region can be closed, and the genome has higher integrity.

Drawings

FIG. 1 is a flow chart of the present invention for genome assembly at the chromosome level;

FIG. 2 is a schematic diagram of the present invention using the python script to close the gap region; FIG. A shows case 1, FIG. B shows case 2, and FIG. C shows case 3; in each of FIGS. (A) to (C), paired lines indicate matching regions in the alignment results;

FIG. 3 is a schematic diagram of alignment before and after the genomic gap of Chlamydomonas is closed; dark gray dots indicate a forward match (red in the original), and light gray dots indicate a reverse match (blue in the original).

Detailed Description

The present invention will be further described with reference to the following examples.

Example 1

A genome assembly method at chlamydomonas chromosome level, as shown in fig. 1, comprising the following steps:

(1) obtaining an original third-generation sequencing data sequence of the chlamydomonas by using a Nanopore sequencing technology, modifying an original genome obtained by the second-generation sequencing of the chlamydomonas by using an FGAP tool to close a gap region with the length of less than 200bp and obtain a preliminarily modified genome;

(2) comparing the original third-generation sequencing data sequence with the preliminarily modified genome by using a Minimap tool, and closing a gap region with the length of 200-1000bp by using a python script to obtain a second modified genome;

(3) according to the overlapping area between reads, using a Necat tool to carry out primary assembly on the original third-generation sequencing data sequence to obtain a DNA fragment after the primary assembly;

(4) and (3) comparing the preliminarily assembled DNA fragment with the genome modified for the second time by using an MUMmer tool, extracting a required sequence by using a python script according to a comparison result, closing a gap region with the length of more than 1000bp, obtaining a final genome, and realizing genome assembly.

In steps (2) and (4), a specific method for closing the gap region by using the python script is shown in fig. 2, and different gap closing strategies are adopted according to the following 3 conditions existing in the sequence alignment result:

case 1: as shown in fig. 2(a), if there are a matching region 1 and a matching region 2 between a material sequence a and a sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, the upstream and downstream positional relationship of the matching region 1 and the matching region 2 in the sequence to be modified is the same as that in the material sequence a, and there is a matching region at one end of a material sequence B and the gap region, selecting a sequence in the material sequence a at the same position as the gap region ("selecting region" in fig. 2 (a)) to fill in the gap region; if there are mismatch regions ("mismatch region 1" and "mismatch region 2" in fig. 2 (a)) between the sequence from the gap region to the matching region 1 and/or the matching region 2 in the sequence to be modified and the sequence at the corresponding position in the material sequence a, the mismatch regions in the sequence to be modified are not modified while filling the gap region;

case 2: as shown in fig. 2(B), if there are a matching region 1 and a matching region 2 between the material sequence a and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, and the upstream and downstream positional relationship of the matching region 1 and the matching region 2 in the sequence to be modified is opposite to the upstream and downstream positional relationship in the material sequence a, and there is a matching region 3 at one end of the material sequence B and the gap region, selecting a sequence in the material sequence B at the same position as the gap region ("selecting region" in fig. 2 (B)) to fill the gap region; if there is a mismatch region between the sequence from the gap region to the matching region 3 in the modified sequence and the sequence at the corresponding position in the material sequence B ("mismatch region" in FIG. 2 (B)), the mismatch region in the sequence to be modified is not modified while filling the gap region;

case 3: as shown in fig. 2(C), if there are a matching region 1 and a matching region 2 between the material sequence a and the sequence to be modified, the matching region 1 and the matching region 2 are located at two ends of a gap region in the sequence to be modified, and the matching region 1 and the matching region 2 overlap in the material sequence a, and there is a matching region 3 at one end of the material sequence B and the gap region, then selecting a sequence in the material sequence B at the same position as the gap region ("selecting region" in fig. 2 (C)) to fill the gap region; if there is a mismatch region between the sequence from the gap region to the matching region 3 in the modified sequence and the sequence at the corresponding position in the material sequence B ("mismatch region" in FIG. 2 (C)), the mismatch region in the sequence to be modified is not modified while filling the gap region;

in the step (2), the original three-generation sequencing data sequence is used as a material sequence, and the preliminarily modified genome is used as a sequence to be modified; in step (4), the preliminarily assembled DNA fragments are used as material sequences, and the second modified genome is used as a sequence to be modified.

The python script code that achieves the above objectives is as follows:

application example

The Chlamydomonas genome (Chlamydomonas _ reinhardtii _ v5.5) in NCBI was used as the original genome, the corresponding Nanopore data was collected as the third generation sequence data required for gap closure, and the original genome was genome-assembled (gap closure) using the method in example 1, with the results shown in table 1.

TABLE 1

N50 in the table indicates that all sequences in the genome are arranged from large to small in length and begin to accumulate, e.g., the length of the sequence just counted when the accumulated length reaches 50% of the total length of the entire sequence. N50 reflects the quality of genome assembly to a certain extent, and the larger N50 is, the better the genome assembly effect is. By comparison, N50 was promoted from 7783580 to 8171174 after the gap was turned off in the Chlamydomonas genome, and all the sequences were increased in length and within the normal range; the length of the sequence was increased from 9730733 to 10652672, and the 4050214-long N fragment was completely closed, forming a whole genome.

The gap regions are compared before and after being closed by MUMmer software and are visually analyzed and evaluated, the result is shown in figure 3, and the sequence integrity before and after the gap regions are closed is basically consistent, so that the gap regions are closed by adopting the method disclosed by the invention, and the original integrity of chromosomes is not changed in the process.

The raw materials and equipment used in the invention are common raw materials and equipment in the field if not specified; the methods used in the present invention are conventional in the art unless otherwise specified.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, alterations and equivalents of the above embodiments according to the technical spirit of the present invention are still within the protection scope of the technical solution of the present invention.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:固相识别雌激素受体的雌二醇衍生物筛选方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!