Pangenome construction method and corresponding structural variation mining method

文档序号:170920 发布日期:2021-10-29 浏览:46次 中文

阅读说明:本技术 一种泛基因组的构建方法及其相应的结构变异挖掘方法 (Pangenome construction method and corresponding structural variation mining method ) 是由 赵均良 李方平 王健 杨武 刘斌 杨梯丰 陈洛 于 2021-08-06 设计创作,主要内容包括:本发明属于基因组数据分析技术领域,具体涉及一种泛基因组的构建方法及其相应的结构变异挖掘方法,通过把基因组比较得到的结构变异放回线性基因组上,同时增加结构变异位点信息文件,构建一种形式上线性化,兼顾多种结构变异形式的高效分析的泛基因组;所述泛基因组不但可以捕获更多全新的,参考基因组未发现的结构变异,而且通过线性化方法结合变异位点信息文件,更好的展示了捕获到的结构变异,使构建的泛基因组更容易理解和分析,更有利于后续应用;本发明构建的泛基因组和二代测序数据进行基因组结构变异分析的方法和流程,并编写了完整的程序代码,实现了基于相对低成本的二代测序数据对结构变异的高效、精准挖掘。(The invention belongs to the technical field of genome data analysis, and particularly relates to a construction method of a pan-genome and a corresponding structural variation mining method thereof, wherein the structural variation obtained by genome comparison is put back to a linear genome, and meanwhile, a structural variation site information file is added to construct a pan-genome which is linearized in form and gives consideration to efficient analysis of various structural variation forms; the pan-genome can capture more and new structural variations which are not found in a reference genome, and the captured structural variations are better displayed by combining a variation site information file through a linearization method, so that the constructed pan-genome is easier to understand and analyze and is more favorable for subsequent application; the method and the process for analyzing the genome structural variation of the pan-genome and the second-generation sequencing data constructed by the invention write complete program codes, and realize the efficient and accurate mining of the structural variation based on the second-generation sequencing data with relatively low cost.)

1. A method for constructing a pan-genome, comprising the steps of:

1) setting a reference genome and an alignment genome, if a plurality of comparison genomes exist, sequencing a first round of alignment genomes according to a sequence set by a user, sequencing a second round of alignment genomes according to a sequence set by the user, and so on;

2) splitting each single chromosome sequence in the reference genome and the comparison genome sequence, and setting the corresponding names of the same chromosomes of the reference genome and the comparison genome;

3) carrying out sequence comparison on the same chromosome set in the step 2) through comparison software to obtain the sequence collinearity characteristics of the reference genome and the comparison genome on the chromosome;

4) identifying the structural variation sites of the sequence collinearity characteristics obtained in the step 3) through structural variation extraction software;

5) screening the structural variation sites generated in the step 4), selecting structural variation inserted by comparing with a reference genome, inserting the sequence of the inserted structural variation into the corresponding position of the reference genome, generating a brand new genome sequence comprising all new insertion fragments and the original reference genome sequence, forming a pan-genome, and generating a file for recording the information of the insertion sites;

6) if the number of the genomes input by the user is more than 2, sequencing second and later comparison genomes to be sequentially used as corresponding comparison genome files, taking the pan-genome generated in the previous round as a reference genome, and repeating the steps 2) -5), and finally generating a pan-genome file and a file containing insertion site variation information.

2. The method for constructing a genome-wide array of claim 1, wherein the alignment software of step 3) comprises MUMMER and Lastz.

3. The method for constructing a genome-wide array of claim 1, wherein the structural variation extraction software of step 4) comprises SVMU.

4. A method for mining a structural variation corresponding to a pan-genome constructed according to any one of claims 1 to 3, comprising the steps of:

a) taking the pan-genome constructed in the method as a reference genome, and comparing Illumina second-generation sequencing data to the reference genome through comparison software to generate an comparison file;

b) extracting sequencing data coverage data of the variant insertion sites according to the variant site information file generated in the step 6) of the claim 1 to generate a variant site sequencing data coverage file;

c) judging the existence or nonexistence of the insertion site structural variation by setting a coverage threshold according to the coverage file of the variation site sequencing data in the step b), thereby obtaining the sample genome structural variation condition based on Illumina second-generation sequencing data.

5. The method for mining a structural variation corresponding to a pan-genome of claim 4, wherein the alignment software used in step a) comprises Bowtie 2.

6. The method for mining the corresponding structural variation of the pan-genome according to claim 4, wherein the sequence extraction software of step b) comprises Samtools.

7. The method for mining structural variation corresponding to pan-genome of claim 4, wherein the coverage threshold of step c) is set by a user, if the coverage of the sequencing data of the short segment of the variation site Illumina is greater than the coverage of the alignment, it indicates that the sample genome sequence is aligned to cover the existence of the variation segment, the existence of the variation site in the sample, otherwise, the sample genome sequence is not existed.

Technical Field

The invention belongs to the technical field of genome data analysis, and particularly relates to a pan-genome construction method and a corresponding structural variation mining method.

Background

Pangenome refers to the sum of all genomic variations in a population. By capturing and presenting the entire genomic variation in the population, the pan-genome provides a complete, reference genome for functional genomics studies that encompasses the entire genomic variation of the population. The pan-genome has important application in the variation analysis of genome, especially in the variation analysis of genome structure.

The current pan-genome construction strategy and technology have great limitation, and common technologies such as a strategy of applying second-generation sequencing data to carry out iterative assembly, a map-to-pan strategy for comparing reference genome with sequencing data and the like. However, the pan-genomic DNA constructed by the technology has low quality and poor integrity, and the application of the pan-genomic DNA in subsequent analysis is greatly limited. The genome-wide constructed by the strategy of comparing a plurality of complete genomes has very high integrity and quality, and is a better genome-wide construction strategy at present.

The construction of the pan-genome based on the whole-genome comparison currently has a plurality of different technical schemes, wherein the technical scheme for constructing the graphical pan-genome based on the graph theory is the currently applied pan-genome construction strategy. This approach can preserve more genomic variations within the population, but graphical genome-wide has significant drawbacks in the presentation and subsequent utilization of these variations. Firstly, the graphical pan-genome organizes all genome variations in an extremely complex manner to form a complex multidimensional variation information structure, so that researchers are difficult to intuitively understand and difficult to directly process and analyze, and the pan-genome is extremely difficult to widely apply in research. In addition, the graphical pan-genome has huge requirements on computing resources in the application process, and the application of the graphical pan-genome in large-scale extensive analysis is limited. Therefore, how to reasonably and efficiently present complex genome structural variation and provide a pan-genome with simplicity and easiness in operation for researchers is a fundamental technical problem to be solved in the field of pan-genome construction and application.

The most important application of pan-genomics is in the analysis of genomic structural variation. Therefore, closely related to the pan-genome construction method is a pan-genome-based genomic structural variation analysis method. The pan-genome constructed by different methods has different forms of variant data organization modes, and a set of structural variant analysis methods corresponding to the pan-genome construction scheme must be created. Only by using the pan-genome construction method and the matched genome structural variation analysis method together, the advantages of the pan-genome can be exerted to the greatest extent, and the efficient and accurate analysis of the genome structural variation can be realized.

Disclosure of Invention

Aiming at the problems, the invention provides a construction method of a pan-genome and a corresponding structural variation mining method thereof, which simplify multi-dimensional variation information by adding a variation information file, construct a pan-genome which is linearized in form and gives consideration to high-efficiency analysis of various structural variation forms, and take the pan-genome constructed by the invention as a reference genome, thereby being beneficial to carrying out a genome structural variation identification and analysis method on Illumina second-generation sequencing data and realizing high-efficiency and accurate genome structural variation analysis and identification.

The technical content of the invention is as follows:

the invention provides a pan-genome construction method, which comprises the following steps:

1) setting a reference genome and an alignment genome, if a plurality of comparison genomes exist, sequencing a first round of alignment genomes according to a sequence set by a user, sequencing a second round of alignment genomes according to a sequence set by the user, and so on;

2) splitting each single chromosome sequence in the reference genome and the comparison genome sequence, and setting the corresponding names of the same chromosomes of the reference genome and the comparison genome;

3) carrying out sequence comparison on the same chromosome set in the step 2) through comparison software to obtain the sequence collinearity characteristics of the reference genome and the comparison genome on the chromosome;

4) identifying the structural variation sites of the sequence collinearity characteristics obtained in the step 3) through structural variation extraction software;

5) screening the structural variation sites generated in the step 4), selecting structural variation which is compared with a reference genome to be insertion variation, inserting the sequence of the insertion structural variation into the corresponding position of the reference genome, generating a brand new genome sequence comprising all new insertion fragments and the original reference genome sequence, forming a pan-genome, and generating a file for recording the information of the insertion sites;

6) if the number of the genomes input by the user is more than 2, sequencing second or later comparison genomes to be sequentially used as corresponding comparison genome files, taking the pan-genome generated in the previous round as a reference genome, and repeating the steps 2) -5), and finally generating a pan-genome file and a file containing insertion site variation information;

step 3) the comparison software comprises MUMMER and Lastz;

step 4), the structural variation extraction software comprises SVMU;

and step 5) the insertion site information comprises an insertion site, an insertion length and the like.

The invention also provides a structural variation mining method based on pan-genome correspondence, which comprises the following steps:

a) taking the pan-genome constructed in the method as a reference genome, and comparing Illumina second-generation sequencing data to the reference genome through comparison software to generate an comparison file;

b) extracting sequencing data coverage data of the variant insertion sites according to the variant site information file sites generated in the step 6) of the claim 1 to generate variant site sequencing data coverage files;

c) judging the existence or nonexistence of the insertion site structural variation by setting a coverage threshold according to the coverage file of the variation site sequencing data in the step b), thereby obtaining the sample genome structural variation condition based on Illumina second-generation sequencing data;

the adopted Illumina second-generation sequencing technology is the most mainstream sequencing solution with low cost at present, but because the sequencing data is short in reading length, the analysis effect on the structural variation of the genome is extremely poor, the invention realizes efficient and accurate analysis and identification of the structural variation of the genome by constructing the pan-genome, and breaks through the technical bottleneck of structural variation analysis of the current second-generation sequencing;

step a) the use alignment software comprises Bowtie 2;

step b) the sequence extraction software comprises Samtools;

and c) setting the coverage threshold value for a user, if the coverage of the short fragment sequencing data of the variant site illumina is greater than the comparison coverage and greater than the threshold value, indicating that the sample genome sequence is compared and covered with the variant fragment, and the sample has the variant site, otherwise, indicating that the sample does not exist.

The invention has the following beneficial effects:

the construction of the linear pan-genome is a brand-new pan-genome construction strategy and method. The genome is obtained by comparing the completely assembled genome, and the high-quality pan-genome construction is realized. The invention puts the structural variation obtained by genome comparison back to the linear genome, and simultaneously adds a structural variation site information file, thereby realizing the simplification of multidimensional variation information, constructing a pan-genome which is linearized in form and gives consideration to the high-efficiency analysis of various structural variation forms; the pan-genome construction scheme can not only completely capture variation among genomes, but also realize the linearized organization of a complex variation structure, so that the construction is easier to understand and read, and more importantly, the subsequent application is easier, and the calculation resource requirement in the subsequent application process is greatly reduced due to the adoption of the linearized genome organization method, so that the large-scale application can be realized;

the pan-genome constructed based on the method is a reference genome, and the sequencing data obtained by an Illumina second-generation sequencing platform is utilized to carry out corresponding structural variation mining and analysis on the pan-genome, so that large-scale accurate structural variation analysis can be realized. The invention constructs a high-quality pan-genome, further takes the pan-genome as a reference genome, compares second-generation sequencing data with the pan-genome, and combines the variation information file of the pan-genome constructed by the invention, thereby realizing efficient and accurate genome structure variation analysis and identification and breaking through the technical bottleneck of structural variation analysis of the current second-generation sequencing.

Drawings

FIG. 1 is a schematic diagram of the pan-genome construction strategy and flow of the present invention;

FIG. 2 is a schematic diagram of the principle and process for identifying genomic structural variation using the pan-genomic and Illumina sequencing data constructed in the present invention.

Detailed Description

The present invention is described in further detail in the following description of specific embodiments and the accompanying drawings, it is to be understood that these embodiments are merely illustrative of the present invention and are not intended to limit the scope of the invention, which is defined by the appended claims, and modifications thereof by those skilled in the art after reading this disclosure that are equivalent to the above described embodiments.

All the raw materials and reagents of the invention are conventional market raw materials and reagents unless otherwise specified.

Example 1

Construction of rice pan-genome:

the rice Nipponbare (Nipponbare) genome (IRGSP 1.0, downloaded from https:// rapdb.dna.affrc.go.jp/website, with the genome sequence files of Nipponbare.fasta), L32 and P106 are the complete assembled genomes of two rice varieties, with the genome sequence files of L32.fasta and P106.fasta, respectively.

The pan-genome was constructed as follows:

1) lg, generating a file with the name of location, wherein the file information is as follows:

Mummer=/home/lfp/soft/mummer-4.0.0beta2/

Lastz=/home/lfp/soft/lastz/src/

svmu=/home/lfp/soft/svmu/

bowtie2=/home/lfp/miniconda3/bin/

Samtools=/home/lfp/miniconda3/bin/

ref=Nipponbare.fasta

query=L32.fasta, P106.fasta

this file is used to set the location of the Mummer, Lastz, svmu, bowtie2 and Samtools software executable files for calls during runtime. Let ref (reference genome) be nipponbare.fasta and query (alignment genome) be L32.fasta and P106.fasta, and construct the genome-wide sequence first with L32 and then with P106.

2) Cfg files are generated, the names of the reference genome and the chromosome with the same comparative genome in the genome file are paired, and the file information is as follows:

chr01——chr01_RaGOO;

chr02——chr02_RaGOO;

and so on

chr12——chr12_RaGOO。

According to the pair.cfg file information, splitting the reference genome and the comparison genome according to chromosomes, and putting the reference genome and the comparison genome into the subsequent operation in pairs according to the same chromosomes corresponding to the user;

3) carrying out sequence comparison on the same chromosome set in the step 2) through comparison software MUMMER and Lastz to obtain the sequence collinearity characteristic of the reference genome and the comparison genome on the chromosome;

4) identifying the structural variation sites of the sequence collinearity characteristics obtained in the step 3) through structural variation extraction software; the program calls comparison, compares the reference genome with chromosomes with the same comparison genome to obtain the sequence collinearity characteristics of each chromosome of the two genomes;

4) identifying and mining the structural variation sites of the sequence collinearity characteristics obtained in the step 3) through structural variation extraction software. The structural variants obtained are screened, the structural variants which are 'insertion variants' relative to the reference genome are selected, the information of the insertion variants is extracted, and an information file is generated as follows:

chr01 319631 chr01_RaGOO 399526 401720 2194;

where column 1 is the reference genome chromosome name, column 2 is the physical location where the reference genome is inserted, column 3 is the aligned genome chromosome name, column 4 is the starting physical location of the aligned genome on the aligned genome relative to the insertion sequence to the reference genome, column 5 is the ending location of the insertion sequence on the aligned genome, and column 6 is the length of the insertion sequence.

5) Extracting the sequence from the comparative genome sequence file according to the insertion structure variation source extracted in the step 4) and the physical position of the sequence on the comparative genome, inserting the sequence into the reference genome according to the insertion fragment obtained in the step 4), generating a linear pan-genome sequence file, and generating a file for recording the insertion site information, wherein the file information is as follows:

1-11-chr01 chr01 325910 325911 328104 2194;

column 1 is the name of the structural variation (in order of chromosome and number of variations), column 2 is the name of the chromosome of the reference genome, column 3 is the physical position of the reference genome before it was originally inserted, column 4 is the starting position of the insertion sequence after it was inserted into the reference genome, column 5 is the ending position of the insertion sequence after it was inserted into the reference genome, and column 6 is the length of the insertion sequence.

The final generation of a pan-genome of size 381Mb, an 8Mb addition to 373Mb of the reference genome (Nipponbare).

6) And (3) according to the input 2 nd comparison genome, namely P106 genome, using the pan genome generated in the step 5) as a reference genome, and using P106.fasta as the comparison genome, and repeating the steps 2) -5), and finally generating a pan genome file and a corresponding insertion site variation information file. The pan-genome size was 391Mb, 18Mb more than the reference genome and 10Mb more than the first round-pan genome.

As shown in fig. 1, which is a schematic diagram of a pan-genome construction strategy and process, in the diagram, step 1 is a schematic diagram of a first round of pan-genome construction, and step 2 is a schematic diagram of a second round of pan-genome construction, structural variation obtained by genome comparison is put back on a linear genome, and meanwhile, a structural variation site information file is added, so that multi-dimensional variation information is simplified, and a high-quality pan-genome in a linearized form is constructed.

Example 2

A mining method for performing genome structural variation based on Illumina second-generation sequencing data and pan-genome comprises the following steps:

a) the pan-genomic sequence constructed in example 1 was used as a reference genome, and the genomic structural variation analysis and identification of R91 were performed using Illumina sequencing data of rice material R91. Comparing the R91 sequencing data to a pan-genome by using comparison software Bowtie2 to generate a comparison file; as a result, it was found that the data rate of R91 compared with the original reference genome (Nipponbare) was only 82.52%, while the data rate compared with the pan-genome reached 93.25%. The pan-genome constructed in the example 1 is proved to have more complete representativeness than the original reference genome (Nipponbare), can obviously improve the comparison efficiency of sequencing data, and provides an important data basis for capturing more structural variations;

b) according to the variant information site file generated in the step 6) of the embodiment 1, short sequence coverage extraction is carried out on the alignment file through sequence extraction software Samtools to generate a variant site sequencing data coverage file, and the file information is as follows:

1-11-chr01 chr01 325911 328104 2194 14;

column 1 is the name of the variation, column 2 is the name of the chromosome of the reference genome (genome-wide), column 3 is the start of the structural variation in the reference genome (genome-wide), column 4 is the end of the structural variation in the reference genome (genome-wide), column 5 is the length of the insert, column 6 is the average coverage of the sequencing data over the segment of the structural variation.

c) Judging the structural variation condition of the sequencing sample (R91) by setting a coverage threshold according to the coverage file of the variation site in the step b);

this example uses Illumina sequencing data averaging 15 times the depth, with fragments with coverage below 5 being set as missing and fragments greater than 5 as present.

d) From the coverage data of step c), the presence or absence of structural variant fragments was analyzed based on the genome-wide of example 1 to obtain the structural variant result of R91.

The Illumina short fragment sequencing data is data generated by a sequencer company of the us from Illumina company;

the coverage threshold is set by a user, if the sequencing data of the Illumina short fragment at the variation site is compared with the coverage greater than the threshold, the variation fragment exists in the sample genome, otherwise, the variation fragment does not exist.

Fig. 2 is a schematic diagram showing the principle and process for identifying structural variation of genome by using the pan-genome and Illumina sequencing data constructed by the present invention, and the specific operation is as shown in example 2, the sequencing data obtained by Illumina second generation sequencing platform is compared with the constructed pan-genome, and the existence or nonexistence of specific structural variation is identified by the coverage analysis of the sequencing data, so that the low-cost and accurate analysis of structural variation of genome can be realized.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于CNV结果判定样本降解的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!