Group variation detection analysis method without reference genome

文档序号:685295 发布日期:2021-04-30 浏览:5次 中文

阅读说明:本技术 一种无参考基因组的群体变异检测分析方法 (Group variation detection analysis method without reference genome ) 是由 徐昊 姜丽荣 孙子奎 于 2020-12-29 设计创作,主要内容包括:本发明公开了一种无参考基因组的群体变异检测分析方法,1)通过dd-RAD的方法进行样本测序,测序后得到每个样本的数据使用flash软件包将有overlap的read1与read2连接起来,通过聚类软件对每个样本的序列聚类,提取每个样本序列中的consensus序列;2)将步骤1)得到的每个样本序列的consensus序列合并,然后进行聚类,过滤,得到群体的consensus序列,通过本发明使无参考基因组的群体进化分析更加高效,可以极大提高变异检测的速度和准确度。(The invention discloses a group variation detection analysis method without a reference genome, which comprises the following steps of 1) carrying out sample sequencing by a dd-RAD method, obtaining data of each sample after sequencing, connecting read1 with overlap with read2 by using a flash software package, clustering the sequence of each sample by clustering software, and extracting a consensus sequence in each sample sequence; 2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain the consensus sequences of the population.)

1. A population variation detection analysis method without a reference genome is characterized by comprising the following steps:

1) performing sample sequencing by a dd-RAD method, connecting read1 with overlap with read2 by using a flash software package to obtain data of each sample after sequencing, clustering sequences of each sample by using clustering software, and extracting consensus sequences in each sample sequence;

2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain consensus sequences of the population;

3) connecting a plurality of N conconsuss sequences to obtain a set of pseudo-reference genome;

4) and then, carrying out mutation detection and filtration on the pseudo-reference genome in the step 3) according to a parameter mutation detection process to obtain detection information.

2. The reference genome-free population variation detection assay of claim 1, wherein:

the database construction double-end reading length of the dd-RAD method in the step 1) for sample sequencing is 150bp, and the insert fragment is 200-500 bp.

3. The reference genome-free population variation detection assay of claim 1, wherein:

the clustering software in the step 1) is the ustacks in the Stacks software package.

4. The reference genome-free population variation detection assay of claim 1, wherein:

the condition of filtering in the step 2) is that the similarity of nucleic acid sequences in clustering is more than 98 percent; the coverage of both reference and query is greater than 95% during clustering.

5. The reference genome-free population variation detection assay of claim 1, wherein: the default number of N is 1000.

6. The reference genome-free population variation detection assay of claim 1, wherein: the variant detection process with parameters in the step 4) adopts bwa software package and gatk software package.

Technical Field

The invention relates to the technical field of gene detection, in particular to a group variation detection analysis method without a reference genome.

Background

Simplified mutation detection without reference, that is, aiming at species without reference genome or with poor quality of reference sequence assembly, a simplified genome sequencing technology (single enzyme digestion, RAD; double enzyme digestion, GBS) is generally adopted, short sequence fragments (Tags) of different samples are clustered and aligned by software, variation among sites is found, and molecular markers are developed.

The population structure difference and the gene communication condition between different subgroups in the same species can be further researched through population evolution analysis, and the population structure characteristics between different species can also be researched, but many species have no reference genome yet, so that the population evolution analysis without the reference genome is required. And (3) carrying out sample sequencing by adopting a dd-RAD method, and carrying out analysis of the parameter-free simplified population evolution project after data is obtained.

The mutation detection tool currently used in the parameterless simplified population evolution project is the containers in the packages (v1.48), and a large amount of computing time and resources are consumed in the actual process in the operation step, and the usage amount is rapidly increased along with the increase of the number of samples. Greatly restricting normal project operation.

Disclosure of Invention

The invention provides a population variation detection analysis method without a reference genome.

The scheme of the invention is as follows:

a reference genome-free population variation detection analysis method comprises the following steps:

1) performing sample sequencing by a dd-RAD method, connecting read1 with overlap with read2 by using a flash software package to obtain data of each sample after sequencing, clustering sequences of each sample by using clustering software, and extracting consensus sequences in each sample sequence;

2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain consensus sequences of the population;

3) connecting a plurality of N conconsuss sequences to obtain a set of pseudo-reference genome;

4) and then, carrying out mutation detection and filtration on the pseudo-reference genome in the step 3) according to a parameter mutation detection process to obtain detection information.

As a preferred technical scheme, the library construction double-end reading length of the dd-RAD method in the step 1) for sample sequencing is 150bp, and the insert fragment is 200-500 bp.

As a preferred technical scheme, the clustering software in the step 1) is the ustacks in Stacks software package.

As a preferred technical scheme, the condition of filtering in the step 2) is that the nucleic acid sequences are similar to be more than 98% in clustering; the coverage of both reference and query is > 95% during clustering.

Preferably, the default number of N is 1000.

Preferably, the mutation detection process with parameters in step 4) adopts bwa software package and gatk software package.

The method for detecting and analyzing the population variation without the reference genome adopts the technical scheme that 1) sample sequencing is carried out by a dd-RAD method, data of each sample obtained after sequencing is connected with read1 with overlap and read2 by using a flash software package, the sequence of each sample is clustered by clustering software, and the consensus sequence in each sample sequence is extracted; 2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain consensus sequences of the population; 3) connecting a plurality of N conconsuss sequences to obtain a set of pseudo-reference genome; 4) and then, carrying out mutation detection and filtration on the pseudo-reference genome in the step 3) according to a parameter mutation detection process to obtain detection information.

The invention has the advantages that: 1. the population evolution analysis of the reference-free genome is more efficient, and the speed and the accuracy of the variation detection can be greatly improved:

2. the invention can filter and screen data more flexibly, is convenient to operate and simplifies the operation process.

Drawings

FIG. 1 is a block diagram of the framework of the present invention.

Detailed Description

In order to make up for the above deficiencies, the present invention provides a method for detecting and analyzing population variation without reference genome to solve the above problems in the background art.

A reference genome-free population variation detection analysis method comprises the following steps:

1) performing sample sequencing by a dd-RAD method, connecting read1 with overlap with read2 by using a flash software package to obtain data of each sample after sequencing, clustering sequences of each sample by using clustering software, and extracting consensus sequences in each sample sequence;

2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain consensus sequences of the population;

3) connecting a plurality of N conconsuss sequences to obtain a set of pseudo-reference genome;

4) and then, carrying out mutation detection and filtration on the pseudo-reference genome in the step 3) according to a parameter mutation detection process to obtain detection information.

The database construction double-end reading length of the dd-RAD method in the step 1) for sample sequencing is 150bp, and the insert fragment is 200-500 bp.

The clustering software in the step 1) is the ustacks in the Stacks software package.

The condition of filtering in the step 2) is that the nucleic acid sequences are similar to be more than 98 percent during clustering; the coverage of both reference and query is > 95% during clustering.

The default number of N is 1000.

The variant detection process with parameters in the step 4) adopts bwa software package and gatk software package.

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Example (b):

1) performing sample sequencing by a dd-RAD method, connecting read1 with overlap with read2 by using a flash software package to obtain data of each sample after sequencing, clustering sequences of each sample by using clustering software, and extracting consensus sequences in each sample sequence;

2) merging the consensus sequences of each sample sequence obtained in the step 1), then clustering and filtering to obtain consensus sequences of the population;

3) connecting a plurality of N conconsuss sequences to obtain a set of pseudo-reference genome;

4) and then, carrying out mutation detection and filtration on the pseudo-reference genome in the step 3) according to a parameter mutation detection process to obtain detection information.

The database construction double-end reading length of the dd-RAD method in the step 1) for sample sequencing is 150bp, and the insert fragment is 200-500 bp.

The clustering software in the step 1) is the ustacks in the Stacks software package.

The condition of filtering in the step 2) is that the nucleic acid sequences are similar to be more than 98 percent during clustering; the coverage of both reference and query is > 95% during clustering.

The default number of N is 1000.

The variant detection process with parameters in the step 4) adopts bwa software package and gatk software package.

The method of the above example was tested in conjunction with the existing non-reference simplified mutation detection method, as follows,

test usage data:

2 groups, 3 samples per group, and about 1G (500M each read1, read 2) sample size; computing resources:

8cpu,16g ram;

time-consuming comparison of key steps (from locus results to variation detection results):

the existing process of simplifying the mutation detection method without reference needs 16h28m30 s;

the flow of the invention requires 4h15m2 s;

the difference between the two increases with the number of samples and the amount of data;

therefore, the invention can effectively shorten the time consumption and achieve accurate detection results.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种检测染色体联合缺失的方法、装置和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!