Method and device for performing variation detection based on methylation sequencing data

文档序号：1863201 发布日期：2021-11-19 浏览：25次中文

阅读说明：本技术 一种基于甲基化测序数据进行变异检测的方法及装置 (Method and device for performing variation detection based on methylation sequencing data ) 是由黄毅林浩翔李俊朱彬彬易鑫杨玲于 2021-08-20 设计创作，主要内容包括：一种基于甲基化测序数据进行变异检测的方法及装置,该方法包括：候选变异提取步骤,包括提取待测样本测序数据中的候选变异；比对文件提取步骤,包括锁定候选变异所在的基因组区域,提取基因组区域位置的比对文件；重比对步骤,包括将所述比对文件进行重比对,生成重比对之后的比对文件,即重比对文件；鉴定及修正步骤,包括对重比对文件进行转化位点的鉴定,修正转化位点的碱基信息,并重新生成修正碱基之后的比对文件,即修正比对文件；变异检测步骤,包括根据修正比对文件,进行变异检测,获得检测结果。通过重比对、转化位点的鉴定及修复,实现基于甲基化测序数据进行变异检测。(A method and a device for detecting variation based on methylation sequencing data are provided, the method comprises the following steps: extracting candidate variation, namely extracting the candidate variation in the sequencing data of the sample to be detected; a comparison file extraction step, which comprises locking the genome region where the candidate variation is located and extracting the comparison file of the genome region position; a step of comparing the comparison files, which comprises comparing the comparison files again to generate a comparison file after the comparison, namely a comparison file; identifying and correcting the conversion sites of the comparison files, correcting the base information of the conversion sites, and regenerating the comparison files after the bases are corrected, namely correcting the comparison files; and a mutation detection step, which comprises carrying out mutation detection according to the corrected comparison file to obtain a detection result. Mutation detection based on methylation sequencing data is realized through re-comparison, identification and repair of transformation sites.)

1. A method for performing variation detection based on methylation sequencing data, comprising:

extracting candidate variation, namely extracting the candidate variation in the sequencing data of the sample to be detected;

a comparison file extraction step, which comprises locking the genome region where the candidate variation is located and extracting the comparison file of the genome region position;

a step of comparing the comparison files, which comprises comparing the comparison files again to generate a comparison file after the comparison, namely a comparison file;

identifying and correcting the conversion sites of the comparison files, correcting the base information of the conversion sites, and regenerating the comparison files after the bases are corrected, namely correcting the comparison files;

and a mutation detection step, which comprises carrying out mutation detection according to the corrected comparison file to obtain a detection result.

2. The method of claim 1, wherein in the step of realigning, the method of realigning is selected from at least one of multiple alignments, alignments based on consensus sequences.

3. The method of claim 2, wherein the multiple alignment specifically comprises: and performing multiple comparison without reference genome assistance on the comparison file, and aligning the multiple comparison result with the reference genome to obtain a new comparison result.

4. The method of claim 2, wherein the identity sequence-based alignment specifically comprises: assembling the sequencing sequence to obtain an assembly diagram, comparing paths in the assembly diagram to a reference genome to obtain a path-reference genome position corresponding relation, comparing the sequencing sequence to all paths in the assembly diagram to obtain a sequencing sequence-path position corresponding relation, and selecting the sequencing sequence-path-reference genome position which accords with a screening rule as a new comparison result.

5. The method of claim 4, wherein the sequencing sequences are assembled by at least one of a de brunah diagram assembly method, an overlap-first-then-spread assembly method;

the screening rule comprises at least one of the following rules:

1) at least one of the highest comprehensive comparison scores;

2) the comprehensive optimal comparison probability is maximum.

6. The method of claim 1, wherein in the identifying and correcting steps, the transformation sites are identified by a method comprising: reading the duplication comparison file, judging whether a conversion mode unique to methylation sequencing is presented according to the comparison condition of each genome position, and identifying the position of a conversion site in sequencing data.

7. The method of claim 6, wherein the mapping of each genomic position is used to determine whether a transformation pattern unique to methylation sequencing is present, and the locations of the transformation sites in the sequencing data are identified by filtering.

8. The method of claim 7, wherein filtering is performed based on support of the sequenced sequence that supports a transformation pattern characteristic of methylation sequencing to identify a transformation site location in the sequenced data;

and/or filtering according to the size relation between the support number of the sequencing sequence supporting the specific transformation mode of the methylation sequencing and the threshold value, and identifying the transformation site position in the sequencing data;

and/or, if the support number of the sequencing sequence supporting the transformation pattern specific to methylation sequencing is greater than the threshold, determining the transformation site position in the sequencing data.

9. The method of claim 6, wherein the unique pattern of conversion for methylation sequencing comprises at least one of a unique pattern of base conversion in TET-assisted pyridine borane methylation sequencing, a unique pattern of base conversion in bisulfite conversion based methylation sequencing.

10. The method of claim 6, wherein the methylation sequencing-specific conversion patterns comprise at least one of the following patterns:

mode 1):

methylated cytosine (C) in the sequenced sequence aligned to the reference genome forward direction becomes thymine (T);

aligning guanine (G) in the sequenced sequence in the reverse direction of the reference genome to adenine (a), which is paired with methylated cytosine (C);

mode 2):

unmethylated cytosine (C) in the sequenced sequence aligned to the forward direction of the reference genome is changed to thymine (T);

guanine (G) in the sequenced sequence aligned in the reverse direction of the reference genome becomes adenine (a), which pairs with unmethylated cytosine (C).

11. The method of claim 1, wherein in the identifying and correcting step, the method of correcting base information of the transformation site comprises: the bases after transformation in the sequencing data were corrected to bases before transformation.

12. The method of claim 1, wherein the candidate variants comprise at least one of insertion variants, deletion variants;

the candidate variation is at least one of germline variation and somatic variation;

in the candidate variation extraction step, sequencing data of the sample to be detected is methylation sequencing data;

the sample to be detected comprises at least one of a tissue sample and a body fluid sample;

the tissue sample comprises a tumor tissue sample;

the body fluid sample comprises at least one of a blood sample and a plasma sample;

the sample to be detected is DNA;

in the candidate variation extraction step, the sequencing data of the sample to be detected is at least one of targeted methylation sequencing data and whole genome methylation sequencing data;

the candidate variation extracting step comprises extracting candidate variations by analyzing comparison information in a comparison file;

the comparison information comprises at least one of a CIGAR character string and an MD character string;

in the candidate variation extraction step, after candidate variation is extracted, a candidate variation set is obtained by preliminarily filtering the support number of the sequencing sequence containing the variation and the frequency of the sequencing sequence;

and when the data is preliminarily filtered, retaining the data which meets at least one of the following conditions:

1) the support number of the sequence to be detected containing variation in the sample is more than or equal to 3;

2) the frequency of a sequencing sequence containing variation in a sample to be detected is more than or equal to 0.15;

extracting candidate variation by analyzing comparison information in a comparison file, wherein the comparison file is obtained by comparing sequencing data of a sample to be detected to a reference genome;

the genomic region in which the candidate mutation is located is a region including the position in which the candidate mutation set is located and including bases in the upstream and downstream portions of the mutation.

13. A system for performing variation detection based on methylation sequencing data, comprising:

the candidate variation extraction device is used for extracting candidate variation in the sequencing data of the sample to be detected;

the comparison file extraction device is used for locking the genome region where the candidate variation is positioned and extracting the comparison file of the genome region position;

the comparison device is used for comparing the comparison files again to generate comparison files after the comparison, namely the comparison files;

the identification and correction device is used for identifying the transformation sites of the comparison files, correcting the base information of the transformation sites and regenerating the comparison files after the bases are corrected, namely correcting the comparison files;

and the variation detection device is used for carrying out variation detection according to the correction comparison file to obtain a detection result.

14. An apparatus for performing variation detection based on methylation sequencing data, comprising:

a memory for storing a program;

a processor for implementing the method of any one of claims 1 to 12 by executing a program stored by the memory.

15. A computer-readable storage medium, characterized in that the medium has stored thereon a program which is executable by a processor to implement the method according to any one of claims 1 to 12.

Technical Field

The invention relates to the field of gene detection, in particular to a method and a device for performing mutation detection based on methylation sequencing data.

Background

There are a variety of mutations in the human genome, including but not limited to nucleotide variations (SNV), insertion deletion variations (InDel), and the like. A significant portion of this is closely related to the formation and development of tumors. The variation can be quickly and accurately identified from the sequenced data through genome sequencing, and the method is very helpful for the research and treatment of tumors.

Recently, methylation sequencing technology has been increasingly applied to tumor genomes. Compared with the common sequencing technology, the methylation sequencing technology can provide abundant methylation modification information and genomic variation information.

However, methylation sequencing results in changes in the base information on the DNA. The predominant methylation sequencing on the market using bisulfite treatment results in the conversion of unmethylated C bases to T. Recently, a new methylation sequencing method, the method of combined treatment of TET enzyme and pyridine borane (TAPS), has resulted in the conversion of methylated C base to T. Such interference with base changes can affect sequence alignment as well as mutation detection. Variation detection software for general genome sequencing is not well compatible for use on top of methylation sequencing data. For single base variation (SNP), there are currently a variety of methods and software for processing methylation data, such as bis-SNP, BS-SNPer, gemBS, and the like. However, no mature method exists for INDEL mutation (INDEL). There are three main problems: firstly, the interference of methylation sequencing can influence sequence alignment, and particularly GAP alignment errors closely related to insertion deletion variation easily occur, so that insertion deletion variation dislocation is caused; second, methylation sequencing results in alteration of the inserted sequence; thirdly, unlike SNPs, indel mutations cannot be handled by modifying the genotype probability model to accommodate base transitions.

Disclosure of Invention

According to a first aspect, in an embodiment, there is provided a method of performing variation detection based on methylation sequencing data, comprising:

extracting candidate variation, namely extracting the candidate variation in the sequencing data of the sample to be detected;

a comparison file extraction step, which comprises locking the genome region where the candidate variation is located and extracting the comparison file of the genome region position;

a step of comparing the comparison files, which comprises comparing the comparison files again to generate a comparison file after the comparison, namely a comparison file;

and a mutation detection step, which comprises carrying out mutation detection according to the corrected comparison file to obtain a detection result.

According to a second aspect, in an embodiment, there is provided a system for performing variation detection based on methylation sequencing data, comprising:

the candidate variation extraction device is used for extracting candidate variation in the sequencing data of the sample to be detected;

the comparison file extraction device is used for locking the genome region where the candidate variation is positioned and extracting the comparison file of the genome region position;

the comparison device is used for comparing the comparison files again to generate comparison files after the comparison, namely the comparison files;

and the variation detection device is used for carrying out variation detection according to the correction comparison file to obtain a detection result.

According to a third aspect, in an embodiment, there is provided an apparatus for performing variation detection based on methylation sequencing data, comprising:

a memory for storing a program;

a processor for implementing the method as described in the first aspect by executing the program stored by the memory.

According to a fourth aspect, in an embodiment, there is provided a computer readable storage medium having a program stored thereon, the program being executable by a processor to implement the method according to the first aspect.

According to the method and the device for mutation detection based on the methylation sequencing data, mutation detection based on the methylation sequencing data is realized through realignment, identification and repair of transformation sites.

Drawings

FIG. 1 is a partial screenshot of an InDel candidate set in example 1;

FIG. 2 is an alignment chart of one InDel site in example 1;

FIG. 3 is a graph showing a comparison of the weight in example 1;

FIG. 4 is a plot of the site repair alignment in example 1.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The terms "connected" and "coupled" when used herein, unless otherwise indicated, include both direct and indirect connections (couplings).

Definition of

As used herein, "mutation" (mutation) refers to an alteration in the nucleotide sequence of an organism, virus, or extrachromosomal DNA genome. Mutations include small-scale mutations that affect one or several nucleotides in a gene, and large-scale mutations, wherein mutations that affect only one nucleotide are referred to as point mutations. Small-scale mutations include: 1) inserting: adding one or more additional nucleotides to the DNA; 2) deletion (c): removing one or more nucleotides from the DNA; 3) and (3) replacing: substitution mutations, usually caused by chemical substances or DNA replication disorders, replace a single nucleotide in a gene with another nucleotide. Large scale mutations involve mutations in chromosomal structures, including: 1) amplification (or gene replication): leading to an increase in copy number of all regions of the chromosome, thereby increasing the dose of genes in the chromosome; 2) deletion (c): large segments of chromosomes are deleted, resulting in loss of genes in this region. The mutations are classified into germ cell mutations and somatic mutations according to genetic property changes. Herein, "mutation" and "variation" are used interchangeably.

As used herein, "germline variation" refers to a variation (inherited almost exclusively from parents) that has been carried during the embryonic development of an animal body, particularly a human. Germline variations are not inherited if they are present in somatic cells, or are inherited if they are present in germ cells. Germ line mutations are also known as germ cell mutations and are derived from germ cells such as sperm and eggs, and therefore, generally all cells in an organism carry germ line mutations.

Herein, "somatic mutation" is also referred to as acquired mutation, and is a mutation acquired in the growth, development or postnatal under the influence of environmental factors of an organism, and refers to a mutation occurring in somatic cells other than sexual cells, and generally, only a part of cells in an organism carry the somatic mutation. Herein, "somatic mutation" and "somatic variation" are used interchangeably.

As used herein, "DNA methylation" is a chemical modification of DNA that alters genetic material without altering the DNA sequence. As early as 1925, DNA methylation modification has been discovered. Numerous studies have shown that DNA methylation has a epigenetic role in gene regulation. In DNA methylation, the most studied is 5-methylcytosine (5mC), a modification that is generally considered to be a stable inhibitory regulator of gene expression.

Herein, TET-assisted pyridine borane methylation sequencing (referred to as TAPS) is a single-base resolution DNA methylation sequencing method with less destruction and higher efficiency, in which 5mC (5-methylcytosine) and 5hmC (5-hydroxymethylcytosine) are oxidized to 5caC (5-carboxymethylcytosine) by TET (Ten-Eleven transformation, referred to as TET) enzyme without bisulfite, and then 5caC is reduced to Dihydrouracil (DHU) by using organoborane (including but not limited to pyridine borane, 2-methylpyridine borane, etc.), and then PCR is performed to convert DHU to thymine (T).

Herein, bisulfite conversion-based methylation sequencing is a DNA methylation detection method based on a second generation sequencing technology, wherein unmethylated cytosine (C) is converted into uracil (U) by bisulfite, U is identified as thymine (T) by using U-tolerant polymerase in a PCR process, C-to-T conversion is achieved, and sequencing data are respectively aligned to C-to-T and G-to-a converted reference genomes during analysis to identify the sample DNA methylation level. In normal human DNA, about 3% to 6% of C are methylated, so that more than 90% of C are converted to T by bisulfite converted sequencing data.

Herein, the term "reference sequence" refers to a known sequence to which a candidate sequence can be compared, e.g., a sequence from a public or internal database. The reference sequence can be a reference genomic sequence.

The terms "genomic location" or "genomic region" are used interchangeably to refer to a region of a genome (e.g., an animal or plant genome, specifically, for example, the genome of a human, monkey, rat, fish, or insect or plant).

According to a first aspect, in an embodiment, there is provided a method of performing variation detection based on methylation sequencing data, comprising:

extracting candidate variation, namely extracting the candidate variation in the sequencing data of the sample to be detected;

a comparison file extraction step, which comprises locking the genome region where the candidate variation is located and extracting the comparison file of the genome region position;

the step of comparing the comparison files comprises the step of comparing the comparison files again to generate a comparison file after the comparison, namely a comparison file;

and a mutation detection step, which comprises carrying out mutation detection according to the corrected comparison file to obtain a detection result.

In one embodiment, the step of aligning comprises at least one of multiple alignments, alignment based on consensus sequences, and the like.

In one embodiment, the multiple alignment specifically comprises: and performing multiple comparison without reference genome assistance on the comparison pair file, and aligning the multiple comparison result with the reference genome to obtain a new comparison result.

In one embodiment, the alignment based on consensus sequences specifically comprises:

assembling the sequencing sequence to obtain an assembly diagram, comparing paths in the assembly diagram to a reference genome to obtain a path-reference genome position corresponding relation, comparing the sequencing sequence to all paths in the assembly diagram to obtain a sequencing sequence-path position corresponding relation, and selecting the sequencing sequence-path-reference genome position which accords with a screening rule as a new comparison result.

Both methods can readjust the alignment of the sequencing sequence and the reference genome, including the alignment position of the sequencing sequence and the position, number and length of the GAP. The purpose of this step was to eliminate false positive indel mutations due to mis-alignments by methylation sequencing.

In one embodiment, the sequencing sequence is assembled by at least one of the methods including but not limited to de brunam assembly, overlap-first-then-extend assembly, and the like.

The Debruton diagram assembly method and the overlap-first and then expansion assembly method can be referred to the existing documents, for example, the Debruton diagram assembly method can be referred to the documents "SOAPdenovo 2: an empirical improved memory-effect short-read de novo assmbler" (Ruibang Luo, Binghang Liu, Yinlong Xie, etc., GigaScience, Volume 1, Issue 1, December 2012, 2047-.

The overlap-and-expand assembly method can be referred to in the document "A content-based preferences algorithm for de novo and reference-defined sequence assessment of short reads" (Tobias Rausch, Sergey Koren, et al, B Ioinformatics, Volume 25, Issue 9,1May 2009, Pages 1118 1124, https:// doi. org/10.1093/bio informatics/btp131, published: 2009, 3.5.5.d.).

The above documents are only exemplary, and reference may be made to the de bruton diagram assembly method or the overlap-and-extend assembly method in other prior art documents. Or inputting the comparison file into the open source software to obtain the comparison file after the re-comparison.

In one embodiment, the screening rules include, but are not limited to, at least one of the following:

1) the comprehensive comparison score is highest;

2) the comprehensive optimal comparison probability is maximum.

In one embodiment, the maximum integrated optimal alignment probability can be used as a screening rule, i.e., the position of the sequencing sequence-path-reference genome with the maximum integrated optimal alignment probability is reserved as a new alignment result, and usually, the alignment score and the optimal alignment probability have a correlation. For example, a sequences may be aligned to b assembly fragments, and combined pairwise, each combination probability is Pab (with a × b probabilities), and similarly, b assembly fragments may be aligned to c genomic positions, each combination probability is Pbc (with b × c probabilities), and finally the total probability is calculated as Pab × Pbc, and the highest probability is selected as the final alignment result.

In one embodiment, the method for identifying the transformation site in the identifying and correcting step comprises: reading the duplication comparison file, judging whether a conversion mode unique to methylation sequencing is presented according to the comparison condition of each genome position, and identifying the position of a conversion site in sequencing data.

In one embodiment, the conversion sites in the sequencing data are identified by filtering based on the alignment of each genomic position to determine whether a conversion pattern unique to methylation sequencing is present.

In one example, the conversion site locations in the sequencing data can be identified by filtering based on the support of the sequencing sequence that supports the conversion patterns characteristic of methylation sequencing.

In one embodiment, the transformation site locations in the sequencing data are identified by filtering based on the size relationship of the support number of the sequencing sequence that supports the transformation pattern unique to methylation sequencing to a threshold value.

In one embodiment, a transformation site position in sequencing data is determined if the support of a sequencing sequence that supports a transformation pattern unique to methylation sequencing is > a threshold.

In one embodiment, the threshold includes but is not limited to 3, where the specific threshold 3 is an empirical value, and the higher the threshold is, the more reliable the threshold is, and the lower the threshold is, the more sensitive the threshold is, and the threshold can be determined according to actual needs.

In one embodiment, the unique pattern of conversion for methylation sequencing includes, but is not limited to, at least one of a unique pattern of base conversion in TET-assisted pyridine borane methylation sequencing, a unique pattern of base conversion in bisulfite conversion based methylation sequencing.

In one embodiment, the methylation sequencing-specific conversion pattern comprises at least one of the following patterns:

mode 1):

methylated cytosine (C) in the sequenced sequence aligned to the reference genome forward direction becomes thymine (T);

guanine (G) in the sequenced sequence aligned in the reverse direction of the reference genome becomes adenine (a), which pairs with methylated cytosine (C); the mode is a base conversion mode which is peculiar to a methylation sequencing method of TET auxiliary pyridine borane;

mode 2):

unmethylated cytosine (C) in the sequenced sequence aligned to the forward direction of the reference genome is changed to thymine (T);

guanine (G) in the sequenced sequence aligned in the reverse direction of the reference genome becomes adenine (a), which pairs with unmethylated cytosine (C). This pattern is a base conversion pattern unique to bisulfite conversion based methylation sequencing.

In one embodiment, the method for modifying the base information of the transformation site in the identifying and modifying step comprises: the bases after transformation in the sequencing data were corrected to bases before transformation. For example, for mode 1), thymine (T) in the sequencing sequence aligned to the forward direction of the reference genome is modified to cytosine (C) before transformation, and adenine (a) in the sequencing sequence aligned to the reverse direction of the reference genome is modified to guanine (G). For example, in mode 2), thymine (T) in the sequencing sequence aligned in the forward direction of the reference genome is modified to cytosine (C), and adenine (a) in the sequencing sequence aligned in the reverse direction of the reference genome is modified to guanine (G).

In one embodiment, the candidate mutations include at least one of insertion mutations and deletion mutations. Insertion variation means insertion of at least one nucleotide into DNA, and deletion variation means deletion of at least one nucleotide in DNA.

In one embodiment, the present invention is also applicable to SNP (single nucleotide polymorphism) detection.

In one embodiment, the candidate variation is at least one of germline variation and somatic variation.

In one embodiment, the candidate variation is a germline variation.

In one embodiment, in the candidate variation extraction step, the sequencing data of the sample to be tested is methylation sequencing data. Methylation sequencing data refers to sequencing data obtained by methylation sequencing.

In one embodiment, methylation sequencing includes, but is not limited to, oxy bisulfite sequencing (OXBS-SEQ), TET assisted pyridine borane sequencing (TAPS), and the like.

In some embodiments, methylation sequencing is performed at a depth of no more than about 40 x. In some embodiments, methylation sequencing is performed at a depth of no greater than about 30 x. In some embodiments, methylation sequencing is performed at a depth of no greater than about 25 x. In some embodiments, methylation sequencing is performed at a depth of no more than about 20 x. In some embodiments, methylation sequencing is performed at a depth of no greater than about 12 x. In some embodiments, methylation sequencing is performed at a depth of no greater than about 10 x. In some embodiments, methylation sequencing is performed at a depth of no more than about 8 x. In some embodiments, methylation sequencing is performed at a depth of no more than about 6 x. In some embodiments, methylation sequencing is performed at a depth of no greater than about 5 x, no greater than about 4 x, no greater than about 3 x, no greater than about 2 x, or no greater than about 1 x.

In one embodiment, the sample to be tested includes, but is not limited to, at least one of a tissue sample and a body fluid sample.

In one embodiment, the tissue sample includes, but is not limited to, a tumor tissue sample.

In one embodiment, the body fluid sample includes, but is not limited to, at least one of a blood sample, a plasma sample.

In one embodiment, the sample to be tested is DNA.

In one embodiment, in the candidate variation extraction step, the sequencing data of the sample to be tested is at least one of targeted methylation sequencing data and genome-wide methylation sequencing data. The targeted methylation sequencing data can be full exome-wide methylation sequencing data or region capture-wide methylation sequencing data. The sequencing method may be a first generation sequencing method (fluorescent-labeled sanger method), a second generation sequencing method (cycle array sequencing by synthesis), a third generation sequencing method (except for a third generation sequencing method which does not require chemical treatment), or the like.

In an embodiment, the candidate variation extracting step includes extracting the candidate variation by analyzing the comparison information in the comparison file.

In one embodiment, the alignment information includes, but is not limited to, at least one of a CIGAR string and an MD string.

The CIGAR string records differences between the sequenced sequence and the reference genomic sequence, such as single base changes, base indels, and the like.

In an embodiment, the candidate variation extracting step further includes obtaining a candidate variation set by performing preliminary filtering on the support number of the sequence including the variation and the frequency of the sequence after extracting the candidate variation.

In one embodiment, the preliminary filtering retains data that satisfies the following condition:

the support number of the sequence sequences containing variation (namely the frequency of the sequence sequences supporting variation) in the sample to be detected is more than or equal to 3;

the frequency of the sequencing sequence containing variation in the sample to be tested is more than or equal to 0.15. The sequencing frequency is the support number of the sequencing sequence of the variation site/the sequencing depth of the variation site. The sequence support number threshold and the sequencing sequence frequency threshold including the variation are merely exemplary, and other thresholds may be set as necessary.

In one embodiment, when the candidate variation is extracted by analyzing the comparison information in the comparison file, the comparison file is obtained by comparing the sequencing data of the sample to be tested with the reference genome. Alignment files include, but are not limited to, BAM files, CRA M files, SAM files.

In one embodiment, the genomic region in which the candidate variation is located refers to a region including the location of the candidate variation set and including the bases of the upstream and downstream portions of the variation. And a comparison file only containing the range of the variation set is generated, so that the subsequent analysis speed is accelerated.

In one embodiment, the genomic region in which the candidate mutation is located is a region that includes the location of the candidate mutation set and includes 200bp bases upstream and downstream of the mutation. I.e., a region comprising 200bp upstream and 200bp downstream of the mutation site. The number of the base upstream and downstream of the mutation is not limited to 200bp, and may be any other number.

According to a second aspect, in an embodiment, there is provided a system for performing variation detection based on methylation sequencing data, comprising:

the candidate variation extraction device is used for extracting candidate variation in the sequencing data of the sample to be detected;

the comparison file extraction device is used for locking the genome region where the candidate variation is positioned and extracting the comparison file of the genome region position;

the comparison device is used for comparing the comparison files again to generate comparison files after the comparison, namely the comparison files;

and the variation detection device is used for carrying out variation detection according to the correction comparison file to obtain a detection result.

According to a third aspect, in an embodiment, there is provided an apparatus for performing variation detection based on methylation sequencing data, comprising:

a memory for storing a program;

a processor for implementing the method as in the first aspect by executing a program stored in a memory.

In one embodiment, a method for germline indel mutation detection based on methylated sequencing data is provided, comprising: rapidly extracting candidate insertion deletion variation in a sample; locking a genome region where the candidate insertion deletion variation is located, and extracting a comparison file of region positions; carrying out re-comparison on the extracted comparison files to generate comparison files after re-comparison; identifying the transformation sites of the regenerated comparison file, correcting the base information of the transformation sites, and regenerating the comparison file after correcting the bases; for the corrected alignment files, mutation detection was performed using genome indel detection software. The method can be applied to targeted methylation sequencing and whole genome methylation sequencing data.

In one embodiment, a method for performing germline indel mutation detection based on methylation sequencing data includes the following steps:

(1) candidate insertion-deletion variants in the sample are extracted rapidly. Candidate variants are extracted by analyzing the CIGAR strings in alignment files (which may be, for example, BAM files, CRAM files, etc.). The CIGAR string records differences between the sequenced sequence and the reference genomic sequence, such as single base changes, base indels, and the like. And obtaining a candidate insertion deletion variant set by preliminarily filtering the support number of the sequencing sequence and the frequency of the sequencing sequence.

(2) And locking the genome region where the candidate insertion deletion variation is positioned, and extracting a comparison file of the region position. To speed up the subsequent analysis, a range of sequencing sequences, such as + -200 bp, are extracted at and near the position of the candidate variation set. And generating a comparison file only containing the range of the variation set.

(3) And carrying out re-comparison on the extracted comparison files to generate the comparison files after re-comparison. The realignment may be obtained by one of two methods. The first is a reference-free genome-assisted multiple alignment (multiplex e sequence alignment) of the sequenced sequences. And aligning the multiple comparison results with the reference genome to obtain new comparison results. The second is alignment based on consensus sequences (consensus alignment). The specific method comprises the following steps: firstly, assembling sequencing sequences by an assembly method (Overlap-Layout-Cons ensus) such as a DeBrujin Graph (De Brujin Graph) or overlapping and expanding. And aligning all paths in the assembly graph to a reference genome to obtain a path-reference genome position corresponding relation. And (3) aligning the sequencing sequence to all paths in the assembly chart to obtain the corresponding relation of the sequencing sequence and the path position. Comprehensively considering the corresponding relation between every two of the three, for example, the comprehensive comparison score is the highest or the comprehensive optimal comparison probability is the highest, and selecting the optimal sequencing sequence-path-reference genome position as the final new comparison result. Both methods allow readjustment of the alignment of the sequencing sequence to the reference genome, including the alignment position of the sequencing sequence and the position, number and length of the GAP. The purpose of this step was to eliminate false positive indel mutations due to mis-alignments by methylation sequencing.

(4) And identifying the transformation sites of the regenerated comparison file, correcting the base information of the transformation sites, and regenerating the comparison file after correcting the bases. Methylation sequencing will generate a unique pattern of methylation base conversion, e.g., TAPS sequencing will change the methylated base C above pair 1(F1) aligned in the forward direction of the reference genome or pair2(R2) aligned in the reverse direction of the reference genome to T, and will change the sequencing base G (paired with methylated base C) above pair 2(F2) aligned in the forward direction of the reference genome or pair 1(R1) aligned in the reverse direction of the reference genome to A. Bisulfite sequencing, however, converts the unmethylated base C in a consistent manner. And reading the comparison file, and judging whether the comparison condition of each genome position presents the specific transformation mode. By filtering, for example, the support of the sequence supporting this methylation-specific pattern is greater than 3. The transformation site location in the sequencing data was finally identified. The converted site is subjected to base repair treatment to convert the converted T base or A base into the original C base or G base. And generating a comparison file after the base is repaired. The purpose of this step is to eliminate erroneous inserts due to methylation sequencing.

(5) For the corrected alignment files, mutation detection was performed using genome indel detection software. This step may use common mutation detection software for general genomic sequencing data, including but not limited to GATK, samtools, freebaseyes, and the like.

Example 1

The sample tested in this example was Coriell human genomic DNA Standard NA12878 (see sample information: http:// www.f-biology. com/pd. jspid ═ 10384).

This example performed high depth whole genome methylation sequencing of samples.

TAPS sequencing is carried out on a sample, a library construction method is carried out according to example 4 of a whole genome methylation non-bisulfite sequencing library and construction of a Chinese patent with an application number of 201911159400.6, and when DNA fragmentation is carried out, a Coriel human genome DNA standard NA12878 and a positive internal reference (methylated pUC19) are mixed and interrupted. The sequencer used for the sequencing on the computer is MGISEQ-T7 (Chinesemesia), and the sequencing mode is as follows: PE100, sequencing depth 47X.

The offline data is compared with a reference genome (Hs37d5) through preprocessing, specifically, BWA software is used for comparison, so as to obtain a compressed comparison file CRAM/BAM (in this embodiment, BAM file). The CRAM/BAM file is an input file of the method.

The preprocessing of the machine-off data comprises the following steps: low quality sequence (reads) filtration (filtration to remove sequences with too large a low base ratio and too large an N base ratio), linker sequence contamination reads filtration.

The detection method comprises the following steps:

(1) reading the CRAM/BAM file, and quickly extracting the InDel candidate set through the CIGAR information by using a mutation detection program. And filtering the InDel if the InDel variation with the read support number less than 3 or the read support frequency less than 0.15 is removed. As shown in fig. 1.

The mutation detection program is carried out according to paragraphs 81 and 82 of the specification of "a method and apparatus for detecting somatic cell mutation" of the Chinese patent application No. 202011158198.8.

(2) And expanding all InDel positions by 100bp before and after (specifically, on the genome sequence, taking the InDel positions as starting points, and respectively extending the upstream and the downstream by 100bp) to obtain candidate processing intervals.

(3) And (3) extracting a sequencing sequence of the file BAM according to the interval comparison in the step (2) by using a view module in samtools software. FIG. 2 shows an alignment of one of the InDel sites.

The mutation information at this point is shown in the table below, and there are two insertion mutations of different base types.

TABLE 1

#CHROM	POS	ID	REF	ALT
					1	3533921	.	A	ACCGGCT
1	3533921	.	A	ACTGGCT

(4) And (3) performing the re-comparison on the BAM file input in the step (3) by using the re-comparison in the Bis-SNP software (which is open source software and is referred to as website https:// sourceform. net/projects/BisSNP/files/BisSNP-0.82.2/) to obtain the BAM file after the re-comparison. This step is based on alignment of Consensus sequences, using OLC (Overlap-Layout-Consensus) assembly method. The OLC assembling method comprises the following steps: firstly, constructing an overlap map (the overlap map refers to a sequence with a plurality of identical bases before and after the overlap map), then bundling the overlap map into contigs, and then selecting the most probable nucleotide sequence by each Contig.

FIG. 3 is a comparison of the weight ratios of the exemplary sites in Table 1.

(5) Reading the file in the step (4), and identifying and repairing the methylated sites by using base repair software. FIG. 4 is a case of an exemplary site repair alignment in Table 1.

(6) Mutation detection of the BAM file In step (5) was performed using the germline mutation detection program Platypus (this is open source software, see website: https:// www.well.ox.ac.uk/research/research-group/tester-group/program-a-hash-based-variant-capacitor-for-next-generation-sequence-data), where the generated InDel is aggregated as the final result. In the results, the misinsertion mutation ACTGGCT generated by methylation sequencing was removed, and only the actually present ACCGGCT was retained.

TABLE 2

#CHROM	POS	ID	REF	ALT
					1	3533921	.	A	ACCGGCT

To compare the high-depth whole genome methylation sequencing result of the present example with the existing common DNA sequencing result, high-depth whole genome common DNA sequencing was performed on the sample.

For high-depth whole genome common DNA sequencing (the sequencing depth is 100 x), the library construction method comprises the steps of TA cloning, connecting a joint, constructing a library, performing terminal repair, adding an A, adding a joint and introducing Index, and constructing a DNA WGS library which can be identified by a sequencer after phosphorylation and PCR, wherein the sequencing method is second-generation sequencing, and the mutation detection software is platypus.

For high-depth whole genome methylation sequencing, the off-line data is preprocessed and aligned with a reference genome (using BWA software) to obtain a compressed alignment file CRAM/BAM (this embodiment is specifically a BAM file). The CRAM/BAM file is an input file of the method of the embodiment. The methylation sequencing detection method of the embodiment is used for analyzing the comparison file to obtain an InDel variation set. The method performance is evaluated by comparing the consistency of variation obtained by methylation sequencing, variation obtained by ordinary DNA sequencing and 3 answer sets of high-reliability standards in the whole genome/exome region. High confidence standard answers are from the GenBank data in Menu (see website: https:// www.illumina.com/platinumogenes. html) and in-bottle genome data (see website: https:// www.nist.gov/programs-projects/genome-bottle).

TABLE 3 comparison of test results for standard NA12878 (genome-wide InDel consensus)

"-" indicates no data.

The consistency is calculated as follows: taking the calculation method of methylation whole genome sequencing consistency as an example, methylation whole genome sequencing consistency is equal to the number of consensus sites/the number of methylation whole genome sites × 100%, (549480/608462) × 100%, (90.30%).

The answer set consistency is the percentage of the number of answer detections to the total number of answers.

As can be seen from Table 3, the common whole genome sequencing identity was 84.55% and the methylated whole genome sequencing identity reached 90.30% as a percentage of the number of consensus sites to the number of variations detected based on sequencing data.

The answer set consistency for the common whole genome sequencing was 94.20%, and the answer set consistency for the methylated whole genome sequencing was 92.02%.

The consistency of the detection results of the two sequencing data and the answer set is higher, which proves that the sensitivity of the two methods is higher and reaches more than 90%. The consistency of the answer set of the methylated whole genome sequencing is close to that of the answer set of the common whole genome sequencing. The sequencing consistency with the whole genome in the whole genome range is higher, which shows that the accuracy is higher and reaches more than 84%. The results show that the mutation detection method based on the methylated whole genome sequencing data has higher accuracy.

TABLE 4 Standard NA12878 test result comparison (Whole exome InDel consistency)

"-" indicates no data.

As can be seen from table 4, the common full exome sequencing identity was 94.02% and the methylated full exome sequencing identity was 96.75% as a percentage of the number of consensus sites to the number of variations detected based on sequencing data.

The consensus of the answer set for the common whole exome sequencing was 97.61%, and the consensus of the answer set for the methylated whole exome sequencing was 95.86%.

Compared with whole genome sequencing, whole exome sequencing has higher sensitivity, higher accuracy and stronger mutation interpretability, so the whole exome sequencing is a commonly used sequencing means in medical detection (such as detection of genetic diseases). The consistency of the method based on the sequencing of the full exome reaches more than 94%, which proves that the method based on the sequencing of the methylated full exome has higher accuracy.

Example 2

High-depth whole genome capture methylation sequencing was performed on 3 tumor patient plasma cfDNA samples, and the library construction method, the instrument used for sequencing, and the sequencing method were the same as in example 1.

In Table 3, subjects corresponding to 190038718BPD samples had liver cancer and were transported at room temperature using streck blood collection tubes.

200037881BP1D sample of the corresponding subjects suffered from liver cancer, and were transported at room temperature using streck blood collection tubes.

208002626BP1D sample corresponding to the subject suffering from left breast cancer, bone and liver metastasis; detection in the past: immunohistochemistry/left milk, HER2 (positive). And (4) carrying out normal-temperature transportation by using a streck blood collection tube.

The offline data is preprocessed and aligned with reference genome (using BWA software) to obtain a compressed alignment file CRAM/BAM (this embodiment is specifically a BAM file). The CRAM/BAM file is an input file of the method of the embodiment. The comparison file was analyzed using the same methylation sequencing detection method as in example 1 to obtain the InDel variant set. The method performance was evaluated by comparing the consistency between the variation obtained by methylation sequencing and that obtained by ordinary DNA sequencing within the captured sequencing region.

As in example 1, the library construction method used in the general DNA sequencing method specifically used was the Hieff NGS Ultima DNA library Prep Kit for MGI library construction Kit.

TABLE 5

As can be seen from table 5, although the tumor cfDNA sample may have an influence on the detection of germline mutation, the methylation sequencing consistency and the whole genome re-sequencing (WGS) consistency both reach more than 80%, and the sensitivity and the accuracy are high. Part of the samples were close to 90%, close to the common whole genome re-sequencing consistency level. The above results indicate that the mutation detection based on the methylation sequencing data has higher accuracy.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种拷贝数变异的检测方法及其应用

Method and device for performing variation detection based on methylation sequencing data

相关技术

网友询问留言