Family denovo mutation-based analysis method and application thereof

文档序号:1818155 发布日期:2021-11-09 浏览:13次 中文

阅读说明:本技术 一种基于家系denovo突变的分析方法及其应用 (Family denovo mutation-based analysis method and application thereof ) 是由 刘志岩 郭方 郑青松 于 2021-07-21 设计创作,主要内容包括:一种基于家系denovo突变的分析方法及其应用,属于生物信息学技术领域。为了解决现有技术中没有一套完整的流程来进行家系的denovo突变分析,本发明通过对denovo测序数据的过滤、比对、SNV、Indel的家系分析和分析结果的过滤,提供了一种更准确、更丰富的基于家系denovo突变的分析方法,该方法在常见和罕见疾病,以及遗传疾病的预测中起到重要作用。(An analysis method based on family denovo mutation and application thereof belong to the technical field of bioinformatics. In order to solve the problem that no complete process is available in the prior art for performing family denova mutation analysis, the invention provides a more accurate and richer analysis method based on family denova mutation by filtering and comparing sequencing data of denova, performing family analysis of SNV and Indel and filtering analysis results.)

1. A method for family-based denovo mutation analysis, comprising the steps of:

s1, data filtering: filtering the off-line data of denovo sequencing by using fastp;

s2, alignment: comparing the filtered data with the human reference genome hg19, and performing quality control and statistics on the result by comparison;

s3, detection of familial denovo mutation: performing family analysis of SNV and Indel on parents and children in the family;

s4, filtering the results of the SNV and Indel family analysis.

2. The analysis method according to claim 1, wherein the specific method of S1 is as follows:

s11, removing reads containing adaptor, automatically identifying a joint sequence, and filtering;

s12, removing reads with the proportion of N being more than 10%;

and S13, removing low-quality reads, wherein the low-quality reads refer to reads with the number of bases with the quality value Q less than or equal to 5 accounting for more than 50% of the whole reads.

3. The analysis method according to claim 1, wherein the specific method of S2 is as follows:

s21, comparing the filtered data with a human reference genome hg19 by using bwa;

s22, sorting the bams in the future by using samtools;

s23, marking sorted bams and removing repeated sequences by using picard;

and S24, carrying out statistics on sequencing coverage, alignment ratio and the like.

4. The analysis method according to claim 1, wherein the specific method of S3 is as follows:

s31, obtaining the gvcf of each sample by using the GATK, and then carrying out family SNV and Indel detection on the parent and child samples in the family;

s32, predicting the denovo mutation site for the results of SNV and Indel by using a statistical method;

s33, annotation of the pedigree SNP and indel mutation sites.

5. The assay of claim 4, wherein the method for predicting the denovo mutation site at S32 is filtering according to the parental and child genotypes, the mutant reads number information and genotype quality value supported by the mutation site.

6. The analytical method of claim 4, wherein S33 the annotating comprises: basic information annotation of variant sites, gene and region information annotation, normal person database annotation and conservative prediction annotation.

7. The analysis method according to claim 1, wherein the specific method of S4 is as follows:

s41, selecting missense mutation sites, frameshift mutation sites and non-frameshift mutation sites which are deleted in function and predicted to be harmful;

s42, filtering a normal crowd sudden change frequency database, and modifying the MAF filtering threshold according to project requirements;

s43, filtering variants with the number of the reads of the supported variable sites less than or equal to 4;

s44, filter genotype: if the genotype of the variation site is homozygous, filtering variants with the ratio of reads supporting mutation at the site to the number of reads covered by the site being less than 0.8; if the genotype at the site of variation is heterozygous, filtration of variants at the site supporting the mutation with a ratio of <0.2 or >0.8 of reads covered by the site.

S45, checking with IGV to get coverage of the results, and filtering false positive sites.

8. The analysis method of claim 7, wherein the databases selected at S42 include the thousand human genome database, esp6500siv2all database, ExACALL database and ExAC EAS database; the filtering threshold value is less than or equal to 0.05.

9. Use of the assay of any one of claims 1-8 for disease prediction.

Technical Field

The invention belongs to the technical field of biology, and particularly relates to an analysis method based on a family denovo mutation and application thereof.

Background

With the development of genome high-throughput sequencing technology, in recent years, many studies show that denovo mutation plays an important role in family diseases, and potential pathogenic genes can be screened by using whole exon or whole genome sequencing technology, but at present, no complete process is available for performing family denovo mutation analysis. Therefore, it is highly desirable to find a method for analyzing family mutations based on denovo sequencing.

Disclosure of Invention

Based on the above problems, it is an object of the present invention to provide a method for analyzing a family mutation based on denovo sequencing, the method comprising the steps of:

s1, data filtering: filtering data of the denovo sequencing machine by using fastp;

s2, alignment: comparing the filtered data with the human reference genome hg19, controlling the quality of the result, and counting the data comparison result after removing the repetitive sequence;

s3, detection of familial denovo mutation: performing family analysis of SNV and Indel on parents and children in the family;

s4, filtering the results of SNV and Indel.

In an embodiment of the present invention, the specific method of S1 is as follows:

s11, removing reads containing adaptor, automatically identifying a joint sequence, and filtering;

s12, removing reads with the proportion of N being more than 10%;

and S13, removing low-quality reads, wherein the low-quality reads refer to reads with the number of bases with the quality value Q less than or equal to 5 accounting for more than 50% of the whole read.

In an embodiment of the present invention, the specific method of S2 is as follows:

s21, comparing the filtered data with a human reference genome hg19 by using bwa;

s22, sorting the bams in the future by using samtools;

s23, marking sorted bams and removing repeated sequences by using picard;

and S24, carrying out statistics on sequencing coverage, alignment ratio and the like.

In an embodiment of the present invention, the specific method of S3 is as follows:

s31, obtaining the gvcf of each sample by using the GATK, and then carrying out family SNV and Indel detection on the parent and child samples in the family;

s32, predicting the denovo mutation site for the results of SNV and Indel by using a statistical method;

s33, annotation of the pedigree SNP and indel mutation sites.

In one embodiment of the present invention, the method for predicting the denovo mutation site in S32 is to perform filtering according to the father, mother and son genotypes, the mutant site supporting mutant reads number information and the genotype quality value.

In an embodiment of the present invention, the annotating at S33 includes: basic information annotation of variant sites, gene and region information annotation, normal person database annotation and conservative prediction annotation.

In an embodiment of the present invention, the specific method of S4 is as follows:

s41, selecting missense mutation sites, frameshift mutation sites and non-frameshift mutation sites which are deleted in function and predicted to be harmful;

s42, filtering a normal crowd sudden change frequency database, and modifying the MAF screening threshold according to project requirements;

s43, filtering variants with the number of the reads of the supported variable sites less than or equal to 4;

s44, filter genotype: if the genotype of the variation site is homozygous, filtering variants with the ratio of reads supporting mutation at the site to the number of reads covered by the site being less than 0.8; if the genotype at the site of variation is heterozygous, filtration of variants at the site supporting the mutation with a ratio of <0.2 or >0.8 of reads covered by the site.

S45, checking with IGV to get coverage of the results, and filtering false positive sites.

In one embodiment of the present invention, the database selected at S42 includes a thousand human genome database, esp6500siv2all database, ExACALL database, and ExAC EAS database.

The invention also provides application of the analysis method in disease prediction.

The invention has the beneficial effects that:

the analysis method provided by the invention can screen possible pathogenic genes and predict the occurrence of common diseases, rare diseases or genetic diseases.

Drawings

FIG. 1 is a graph of coverage results; wherein A in FIG. 1 is the base ratio of different sequencing depths, the abscissa represents the sequencing depth, and the ordinate represents the ratio of the bases in all the bases in the sequencing depth; b in FIG. 1 is the cumulative base ratio at different depths;

fig. 2 is an overlay of the results obtained using IGV viewing.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows: family denova mutation-based analysis method

The analysis method based on the family denovo mutation comprises the following steps:

s1, data filtering: data from denovo sequencing was filtered using fastp, with the following specific filtering steps:

s11, removing reads containing adaptor, automatically identifying a joint sequence, and filtering;

s12, removing reads with the proportion of N being more than 10%;

and S13, removing low-quality reads, wherein the low-quality reads refer to reads with the number of bases with the quality value Q less than or equal to 5 accounting for more than 50% of the whole read.

S2, alignment: comparing the filtered data with the hg19 of the human reference genome, controlling the quality of the result, and counting the data comparison result after removing the repetitive sequence, wherein the specific method comprises the following steps:

s21, comparing the filtered data with a human reference genome hg19 by using bwa;

s22, sorting the bams in the future by using samtools;

s23, marking sorted bams and removing repeated sequences by using picard;

s24, counting the sequencing coverage, the comparison rate and the like, wherein the coverage distribution is an important index for measuring the sequencing uniformity, and the statistical result of the coverage is shown in figure 1.

S3, detection of familial denovo mutation: performing family analysis of SNV and Indel on parents and children in the family, wherein the specific method comprises the following steps:

s31, obtaining the gvcf of each sample by using the GATK, and then carrying out family SNV and Indel detection on the parent and child samples in the family;

s32, predicting the denovo mutation site (filtering according to the father, mother and son genotypes, and the information of the number of mutant reads supported by the mutation site and the genotype quality value) by using a statistical method for the results of SNV and Indel;

s33, annotating the family SNP and indel mutation sites; the annotation includes: basic information annotation of variant sites, gene and region information annotation, normal person database (frequency) annotation, conservative (harmful) prediction annotation.

Basic information annotation of variant sites: the partial information is detailed information of the mutation sites, including the coverage depth of the mutation sites, the basic type and pure heterozygous information before and after mutation, the quality value of the mutation, and the like. The information of the mutation sites can play an important role in family analysis or screening, and meanwhile, the accuracy of the result can be evaluated.

Gene and region information annotation: this annotation allows a detailed understanding of the specific position and region of the gene structure (corresponding amino acids) at the site of variation, and helps to understand the association between variation and disease.

Normal person database (frequency) annotation: many of the variant sites in the population are polymorphic (high frequency), while truly deleterious variant sites are generally of low frequency. The database mainly comprises thousands of people, ESP6500 and the like, which is helpful for understanding the frequency of the mutation site and finding out the pathogenic mutation site.

Conservative (deleterious) prediction annotation: generally, individual mutations are very many, and truly harmful mutations are rare, so that the part notes that a variety of internationally-used variation harmfulness prediction software and databases are used for carrying out harmfulness prediction and evaluation on variation sites, and prediction results can assist in finding out truly harmful mutation sites.

Database annotation: the disease-related database annotation of the gene of the mutation site can be used for knowing whether the mutation site is related to a certain type of disease or not, and also knowing in which paths the gene of the mutation site exists, and is of great significance for understanding the biological function of the gene.

Family genotype information: the father, mother and son genotypes and the site support mutant reads number information and genotype quality values.

S4, filtering the results of SNV and Indel, wherein the specific method is as follows:

s41, selecting missense mutation sites, frameshift mutation sites and non-frameshift mutation sites which are deleted in function and predicted to be harmful;

s42, filtering a normal population sudden change frequency database, modifying the MAF screening threshold according to project requirements, and selecting the database comprising a thousand human genome database, an esp6500siv2ALL database, an ExAC ALL database and an ExAC EAS database. If the family samples are mostly from China families, filtering the mutation frequency of the Asians is added into the analysis process of screening the pathogenic mutation sites.

S43, filtering variants with the number of the reads of the supported variable sites less than or equal to 4;

s44, filter genotype: if the genotype of the variation site is homozygous, filtering variants with the ratio of reads supporting mutation at the site to the number of reads covered by the site being less than 0.8; if the genotype at the site of variation is heterozygous, filtration of variants at the site supporting the mutation with a ratio of <0.2 or >0.8 of reads covered by the site.

S45, using IGV to look for coverage of results (see fig. 2), filtering false positive sites.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种T790M和C797S顺反式突变类型识别及计算方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!