Method and device for detecting copy number variation

文档序号：1289187 发布日期：2020-08-28 浏览：9次中文

阅读说明：本技术 拷贝数变异的检测方法和装置 (Method and device for detecting copy number variation ) 是由曹善柏王文平张萌萌郭璟楼峰于 2020-05-13 设计创作，主要内容包括：本发明提供了一种拷贝数变异的检测方法和装置。该检测方法包括：获取待测样本的测序比对数据；计算测序比对数据中每个碱基位点的测序深度；将参考基因组划分为多个bin的方式,利用每个碱基位点的测序深度,计算待测样本的每个bin的拷贝数；合并拷贝数与指定contig的倍性不同的bin,得到发生胚系拷贝数变异的区域。该方面能够检测出长度超过1000bp的基因外显子的缺失或重复,与现有技术中的芯片方法相比,该方法检测CNV具有更高的覆盖度、分辨率及更准确的拷贝数评估,不仅能够检测某些已知位点的拷贝数变异情况,而且可以检测未知的拷贝数变异情况,提高检测的灵敏度。(The invention provides a method and a device for detecting copy number variation. The detection method comprises the following steps: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, can detect the copy number variation condition of certain known sites, can detect the unknown copy number variation condition and improve the detection sensitivity.)

1. A method for detecting copy number variation, the method comprising:

obtaining sequencing comparison data of a sample to be detected;

calculating the sequencing depth of each base site in the sequencing comparison data;

dividing a reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;

and combining bins with copy numbers different from the ploidy of the designated contigs to obtain a region with germline copy number variation.

2. The method of claim 1, wherein obtaining sequencing comparison data for a sample to be tested comprises:

obtaining sequencing original data of a sample to be detected;

performing quality control on the sequencing original data to obtain sequencing comparison data;

preferably, performing quality control on the sequencing raw data to obtain the sequencing comparison data comprises:

preprocessing the sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;

comparing the preprocessed data with a reference genome sequence to obtain comparison result data;

filtering the comparison result data to remove reads with repeated comparison results to obtain the sequencing comparison data;

more preferably, the comparison result data is filtered, and the filtering process further includes filtering to remove reads outside the target capture area.

3. The method according to claim 1, wherein before the calculating the copy number of each of the bins of the test sample in a manner of dividing a reference genome into a plurality of bins, the method further comprises,

dividing a reference genome into a plurality of bins, and carrying out normalization processing on the sequencing depth of each bin of a sample to be detected;

then calculating the copy number of each bin by using the sequencing depth after normalization;

preferably, the normalization process comprises:

establishing a normalization model by using a principal component analysis method according to the sequencing depth of the sample for constructing the base line in each bin;

normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model;

preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested by using the sequencing depth after normalization.

4. The method of claim 1, wherein combining bins with copy numbers different from the ploidy of a given contig to obtain regions of copy number variation comprises:

screening bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;

combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the copy number variation.

5. An apparatus for detecting copy number variation, the apparatus comprising:

the acquisition module is used for acquiring sequencing comparison data of a sample to be detected;

the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data;

the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;

and the merging module is used for merging bins with copy numbers different from the ploidy of the designated contigs to obtain a region with germline copy number variation.

6. The detection device according to claim 5, wherein the acquisition module comprises:

the acquisition submodule is used for acquiring sequencing original data of a sample to be detected;

the quality control module is used for performing quality control on the sequencing original data to obtain the sequencing comparison data;

preferably, the quality control module includes:

the removing module is used for preprocessing the sequencing original data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;

the comparison module is used for comparing the preprocessed data with a reference genome sequence to obtain comparison result data;

the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results, and obtaining the sequencing comparison data;

more preferably, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.

7. The detection apparatus according to claim 5, wherein the copy number calculation module comprises:

the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested;

a copy number calculation submodule for calculating the copy number of each bin using the sequencing depth after normalization;

preferably, the normalization module comprises:

the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin;

a normalization submodule, configured to normalize the sequencing depth of each bin in the sample to be tested by using the normalization model;

more preferably, the copy number calculation submodule is a Viterbi module.

8. The detection apparatus according to claim 5, wherein the merging module comprises:

the screening module is used for screening the bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;

and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the copy number variation.

9. A storage medium comprising a stored program, wherein the program, when executed, controls a device in which the storage medium is located to perform the method for detecting copy number variation according to any one of claims 1 to 4.

10. A processor configured to execute a program, wherein the program executes the method for detecting copy number variation according to any one of claims 1 to 4.

Technical Field

The invention relates to the field of biological information analysis, in particular to a method and a device for detecting copy number variation.

Background

CNV refers to copy number polymorphisms of greater than 1kb in length, and is a type of genomic Structural Variation (SV) including deletion (deletion), insertion (insertion), duplication (duplication), and complex multi-site variation (complex-site variants). One of the production mechanisms of CNV is DNA recombination, including non-allelic homologous recombination (NAHR), non-homologous end-joining (NHEJ), and the like. CNV caused by DNA recombination can affect gene expression from several aspects: (1) gene dosage; (2) gene disruption; (3) gene fusion; (4) a position effect; (5) dominant recessive alleles, and the like.

The detection of CNV is currently performed by the following methods:

in the multiple ligation amplification (MLPA) technique, two adjacent probes are designed for each target gene to be detected, after the probes are paired and hybridized with a target sequence through a universal primer, the two adjacent probes are connected through a ligation reaction, and the amount of a ligation product is in direct proportion to the copy number of the target gene. The ligation product can be analyzed for gene copy number according to electrophoresis results after PCR amplification.

Chip technology, which is to make the target of interest into a microarray chip and systematically scan key regions in the genome. At present, the most widely used chips include Comparative Genomic Hybridization (CGH) and SNP chips. This technique can only detect known CNVs.

Disclosure of Invention

The invention mainly aims to provide a method and a device for detecting copy number variation, so as to solve the problem of low sensitivity of mutation detection in the prior art.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting copy number variation, the method comprising: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred.

Further, obtaining sequencing comparison data of the sample to be tested comprises: obtaining sequencing original data of a sample to be detected; performing quality control on sequencing original data to obtain sequencing comparison data; preferably, the quality control of the sequencing raw data to obtain the sequencing alignment data comprises: preprocessing sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; comparing the preprocessed data with a reference genome sequence to obtain comparison result data; filtering the comparison result data to remove reads with repeated comparison results to obtain sequencing comparison data; more preferably, the comparing filters the result data, further comprising filtering out reads outside the target capture area.

Further, before the copy number of each bin of the sample to be detected is calculated in a mode of dividing the reference genome into a plurality of bins, the detection method further comprises the steps of dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be detected; then calculating the copy number of each bin by using the sequencing depth after normalization; preferably, the normalization process comprises: establishing a normalization model by using a principal component analysis method according to the sequencing depth of a sample for establishing a base line in each bin; normalizing the sequencing depth of each bin in the sample to be tested by using a normalization model; preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested, using the sequencing depth after normalization.

Further, combining bins with copy numbers different from the ploidy of the designated contigs to obtain regions with copy number variation comprises: screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

According to a second aspect of the present application, there is provided a device for detecting copy number variation, the device comprising: the acquisition module is used for acquiring sequencing comparison data of a sample to be detected; the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data; the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; and the merging module is used for merging bins with copy numbers different from the ploidy of the designated contigs to obtain a region with germline copy number variation.

Further, the acquisition module includes: the acquisition submodule is used for acquiring sequencing original data of a sample to be detected; the quality control module is used for performing quality control on the sequencing original data to obtain sequencing comparison data; preferably, the quality control module comprises: the removing module is used for preprocessing sequencing raw data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; the comparison module is used for comparing the preprocessed data with the reference genome sequence to obtain comparison result data; the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results to obtain sequencing comparison data; more preferably, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.

Further, the copy number calculation module includes: the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested; a copy number calculation submodule for calculating the copy number of each bin using the normalized sequencing depth; preferably, the normalization module comprises: the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin; the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model; more preferably, the copy number calculation sub-module is a Viterbi module.

Further, the merging module includes: the screening module is used for screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

According to a third aspect of the present application, there is provided a storage medium including a stored program, wherein the apparatus on which the storage medium is located is controlled to perform any one of the above-described copy number variation detection methods when the program is executed.

According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program executes any one of the above methods for detecting copy number variation.

By applying the technical scheme of the invention, the ploidy (namely the copy number) of each bin is obtained in a bin-based mode, and then the ploidy of the designated contig is compared to combine the differential bins which are divided into a plurality of different bins and belong to the same chromosome of the same gene, so that the deletion or the duplication of the gene exon with the length of more than 1000bp is detected. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for detecting copy number variation according to a preferred embodiment of the present invention;

FIG. 2 is a flowchart showing details of a method for detecting copy number variation according to example 2 of the present invention;

FIG. 3 is a graph showing verification of the detection result of copy number variation of a known sample according to example 3 of the present invention;

fig. 4 is a schematic structural diagram of a copy number variation detection apparatus according to a preferred embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

Interpretation of terms:

somatic cell CNV: somatic CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of Somatic tissues (e.g., tumor tissue only), and normal tissues are often required for control in assays.

Embryonic line CNV: germline CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of germ line cells (and, therefore, all tissue cells).

And (5) reading: sequences generated by high throughput sequencing platforms are called reads.

Contig: the splicing software is based on the overlap region (overlap) between reads, and the sequence obtained by splicing is called contig (contig).

Designating contig: refers to contigs of the reference genome of the species to be tested. The designated contigs of human are 24 chromosomes. The ploidy of contig is specified, the ploidy of autosomal chromosomes is 2, and the ploidy of X and Y stains is 1.

Sequencing depth: the ratio of the total base number obtained by sequencing to the size of the genome to be detected is referred to. Assuming that one gene is 2M in size and 10X deep sequencing, the total amount of data obtained is 20M.

As mentioned in the background art, the existing CNV detection method can only detect the known CNV, but cannot detect other possible unknown CNVs, and therefore, in order to overcome the defect of low detection sensitivity of the prior art, the present application proposes a new improvement scheme.

12页详细技术资料下载

Method and device for detecting copy number variation

相关技术

网友询问留言