High-throughput sequencing mutation detection method, equipment, device and readable storage medium

文档序号：1818157 发布日期：2021-11-09 浏览：4次中文

阅读说明：本技术 一种高通量测序突变检测方法、设备、装置及可读存储介质 (High-throughput sequencing mutation detection method, equipment, device and readable storage medium ) 是由李超于 2021-08-24 设计创作，主要内容包括：本发明涉及生物技术领域,特别是涉及一种高通量测序突变检测方法、设备、装置及可读存储介质。本发明提供一种高通量测序突变检测方法,包括：获取基因样本中各候选突变位点的特征信息,所述特征信息来源于基因样本的高通量测序数据；将各候选突变位点进行层次聚类分析；获取各类候选突变位点的背景值pbg；获取各类候选突变位点的最低检测下限；确定各候选突变位点的检测阈值。本申请所提供的高通量测序突变检测方法,可以通过自适应优化的算法,针对每个样本特有的数据特点智能的设定分析的阈值,在保证阳性位点检出的前提下尽可能的去除噪音造成的假阳性,可以更加精准的从样本中检测突变,从而具有良好的产业化前景。(The invention relates to the field of biotechnology, in particular to a high-throughput sequencing mutation detection method, equipment, a device and a readable storage medium. The invention provides a high-throughput sequencing mutation detection method, which comprises the following steps: acquiring characteristic information of each candidate mutation site in a gene sample, wherein the characteristic information is derived from high-throughput sequencing data of the gene sample; performing hierarchical clustering analysis on each candidate mutation site; obtaining background values pbg of various candidate mutation sites; acquiring the lowest detection lower limit of each candidate mutation site; and determining the detection threshold of each candidate mutation site. The high-throughput sequencing mutation detection method provided by the application can intelligently set the analysis threshold value aiming at the specific data characteristics of each sample through the self-adaptive optimization algorithm, remove false positives caused by noise as far as possible on the premise of ensuring the detection of the positive sites, and can detect mutation from the sample more accurately, thereby having good industrialization prospect.)

1. A high throughput sequencing mutation detection method comprising:

s1), obtaining characteristic information of each candidate mutation site in the gene sample, wherein the characteristic information is derived from high-throughput sequencing data of the gene sample;

s2) performing hierarchical clustering analysis on each candidate mutation site based on the characteristic information and the target value S of each candidate mutation site, wherein the target value S is obtained by calculating the average value a of the distance d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located and the average value b of the distance d between each candidate mutation site and the candidate mutation site in the nearest classification;

s3) obtaining background values pbg of various candidate mutation sites based on the mutation abundance of the candidate mutation sites according to the hierarchical clustering analysis result;

s4) obtaining the lowest detection lower limit of each type of candidate mutation sites based on the depth of the candidate mutation sites and the classified background value pbg of the candidate mutation sites according to the hierarchical clustering analysis result;

s5) determining the detection threshold value of each candidate mutation site according to the background value pbg of the category where each candidate mutation site is located and the lowest detection lower limit of each candidate mutation site.

2. The high throughput sequencing mutation detection method of claim 1 further comprising: and comparing the high-throughput sequencing data of the gene sample with the human reference genome data to identify candidate mutation sites in the gene sample.

3. The method of high throughput sequencing mutation detection according to claim 2 wherein the high throughput sequencing data of the gene sample is compared to human reference genomic data by a BWA algorithm;

and/or, identifying candidate mutation sites in the gene sample by the VarDict algorithm.

4. The method for high throughput sequencing mutation detection according to claim 1, wherein said characteristic information comprises one or more of depth, mutation depth, plus strand base reference depth, minus strand base reference depth, plus strand base variation depth, minus strand base variation depth, genotype, mutation abundance, strand deviation, position on a read fragment, standard deviation of position on a read fragment, average base mass fraction, standard deviation of base mass fraction, alignment quality, high quality fragment ratio, high quality fragment mutation abundance, whether it is a microsatellite site, microsatellite site unit length, total number of mismatches on a fragment, sequence at 5 'end, sequence at 3' end, mutation type, repetition ratio.

5. The method for high throughput sequencing mutation detection according to claim 1 wherein the target value s is calculated as follows:

wherein a is the average value of the distance d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located; b is the average of the distance d between each candidate mutation site and the candidate mutation site in the closest one of the classes.

6. The method for high throughput sequencing mutation detection according to claim 1 wherein the distance d between two sites is the difference in abundance of two sites, preferably the absolute value of the difference in abundance of two sites.

And/or, the background value pbg for each class of candidate mutation sites is the median of the abundance of mutations for each candidate mutation site in the class.

And/or the calculation method of the lowest detection lower limit of each candidate mutation site comprises the following steps:

f＝ln(1–p)/–n

wherein f is the lowest detection lower limit of the candidate mutation sites;

p is the background value pbg of the classification in which the candidate mutation site is located;

and n is the depth of the candidate mutation site.

And/or, the detection threshold of each candidate mutation site is a value that is greater than both the background value pbg of the class in which it is located and the lowest detection threshold of each candidate mutation site.

7. The high throughput sequencing mutation detection method of claim 1 further comprising: and obtaining the mutation detection result of each candidate mutation site according to the detection threshold of each candidate mutation site.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for high throughput sequencing mutation detection according to any one of claims 1 to 7.

9. An apparatus, comprising: a processor and a memory, the memory for storing a computer program, the processor for executing the computer program stored by the memory to cause the apparatus to perform the high throughput sequencing mutation detection method of any one of claims 1-7.

10. An apparatus, the apparatus comprising:

the characteristic information acquisition module is used for acquiring the characteristic information of each candidate mutation site in the gene sample, wherein the characteristic information is derived from the high-throughput sequencing data of the gene sample;

the hierarchical clustering analysis module is used for carrying out hierarchical clustering analysis on each candidate mutation site based on the characteristic information and the target value s of each candidate mutation site, wherein the target value s is obtained by calculating the average value a of the distance d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located and the average value b of the distance d between each candidate mutation site and the candidate mutation site in the closest classification;

a background value pbg calculation module for obtaining background values pbg of various candidate mutation sites based on the mutation abundance of the candidate mutation sites according to the hierarchical clustering analysis result;

the lowest detection lower limit calculation module is used for acquiring the lowest detection lower limit of each type of candidate mutation sites based on the depth of the candidate mutation sites and the classified background value pbg of the candidate mutation sites according to the hierarchical clustering analysis result;

the detection threshold calculation module is used for determining the detection threshold of each candidate mutation site according to the background value pbg of the type of each candidate mutation site and the lowest detection lower limit of each candidate mutation site;

preferably, the kit further comprises a candidate mutation site identification module for comparing the high-throughput sequencing data of the gene sample with the human reference genome data to identify candidate mutation sites in the gene sample;

preferably, the method further comprises a mutation detection result calculation module, configured to obtain a mutation detection result of each candidate mutation site according to a detection threshold of each candidate mutation site.

Technical Field

The invention relates to the field of biotechnology, in particular to a high-throughput sequencing mutation detection method, equipment, a device and a readable storage medium.

Background

Mutation detection of tumors by high-throughput sequencing is widely applied to basic and clinical research of tumors. However, since a large amount of interfering noise from unnatural sources is introduced in sample preparation, storage, experiments and analysis, a key step in mutation detection is to accurately distinguish between actual mutations and noise signals from different sources.

At present, the noise removal experiment and data analysis means mainly comprise the following types:

1. for noise signals randomly generated in the sequencing process, the noise of the type randomly appears at low frequency, but can be corrected by combining technical modes such as repetition generated in the sequencing, molecular tag combination (CN106834275A), virtual molecular tag combination (CN107944225B) and the like with high-depth sequencing;

2. for repeatable non-random noise generated in the experimental process, such as noise introduced in the processes of DNA extraction, interruption and capture, because the occurrence frequency of the noise is high and accords with a certain statistical rule, a background library established by a large number of negative samples can be used for establishing a background correction model for correction and differentiation (CN 105574365B);

3. setting different analysis thresholds for specific mutation types, different types of variant noise background values are different, and performing threshold setting for classification, for example, setting different detection thresholds for point mutation and insertion/deletion mutation, respectively, can improve the accuracy of analysis (CN 108690871A).

Several technical solutions mentioned above solve the problems of low-frequency random noise, high-frequency inherent noise and different types of mutated inherent noise, respectively, but there is another noise type in practice, non-random low-frequency sample-specific noise, and there are many factors that may cause this type of noise to appear, for example, 1, damage to DNA of the sample itself, which often appears in a common formalin-fixed tumor sample, and such a sample may often have (C > T | G > a) type noise variation; 2. noise caused by inconsistent fragmentation lengths of samples (too long or too short fragments) cannot be reproduced and eliminated by an ideal background noise model due to different experimental conditions for each sample; 3. the PCR error caused by different PCR amplification rounds and amplification enzyme fidelity rates of the samples is caused, the PCR amplification rounds are different because the initial amount of each sample is different, the error introduction ratio of each amplification is related to the state of the amplification enzyme experiment, and the difference exists between the samples. The common characteristics of the factors are that the sample is specific and cannot be effectively reproduced among the samples, but the sample is non-random and repeatedly appears in the same experiment, so that the detection cannot be removed by the technical scheme mentioned in the previous part, and the accuracy of the detection result is influenced.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention is directed to a high throughput sequencing mutation detection method, which solves the problems of the prior art.

To achieve the above and other related objects, one aspect of the present invention provides a high throughput sequencing mutation detection method, comprising:

s3) obtaining background values pbg of various candidate mutation sites based on the mutation abundance of the candidate mutation sites according to the hierarchical clustering analysis result;

In some embodiments of the invention, further comprising: and comparing the high-throughput sequencing data of the gene sample with the human reference genome data to identify candidate mutation sites in the gene sample.

In some embodiments of the invention, high throughput sequencing data of a gene sample is compared to human reference genomic data by a BWA algorithm;

and/or, identifying candidate mutation sites in the gene sample by the VarDict algorithm.

In some embodiments of the invention, the characteristic information comprises one or more of depth, variation depth, plus strand reference base depth, minus strand reference base depth, plus strand variation base depth, minus strand variation base depth, genotype, mutation abundance, strand deviation, position on a read fragment, standard deviation of position on a read fragment, average base mass fraction, base mass fraction standard deviation, alignment quality, high quality fragment proportion, high quality fragment mutation abundance, whether it is a microsatellite site, microsatellite site unit length, total number of mismatches on a fragment, sequence at 5 'end, sequence at 3' end, mutation type, repetition proportion.

In some embodiments of the present invention, the target value s is calculated as follows:

wherein a is the average value of the distance d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located;

b is the average of the distance d between each candidate mutation site and the candidate mutation site in the closest one of the classes.

In some embodiments of the invention, the distance d between two sites is the difference in abundance of the two sites, preferably the absolute value of the difference in abundance of the two sites.

And/or, the background value pbg for each class of candidate mutation sites is the median of the abundance of mutations for each candidate mutation site in the class.

And/or the calculation method of the lowest detection lower limit of each candidate mutation site comprises the following steps:

f＝ln(1-p)/-n

wherein f is the lowest detection lower limit of the candidate mutation sites;

p is the background value pbg of the classification in which the candidate mutation site is located;

and n is the depth of the candidate mutation site.

In some embodiments of the invention, further comprising: and obtaining the mutation detection result of each candidate mutation site according to the detection threshold of each candidate mutation site.

Another aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the high throughput sequencing mutation detection method described above.

In another aspect, the invention provides an apparatus comprising: a processor and a memory, the memory for storing a computer program, the processor for executing the computer program stored by the memory to cause the apparatus to perform the high throughput sequencing mutation detection method described above.

In another aspect, the present invention provides an apparatus, comprising:

Drawings

FIG. 1 is a schematic flow chart of the high throughput sequencing mutation detection method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments, and other advantages and effects of the present invention will be apparent to those skilled in the art from the disclosure of the present specification.

The invention provides a high-throughput sequencing mutation detection method in a first aspect, which comprises the following steps:

s3) obtaining background values pbg of various candidate mutation sites based on the mutation abundance of the candidate mutation sites according to the hierarchical clustering analysis result;

The high-throughput sequencing mutation detection method provided by the invention can comprise the following steps: and acquiring characteristic information of each candidate mutation site in the gene sample, wherein the characteristic information is derived from high-throughput sequencing data of the gene sample. Generally speaking, those skilled in the art can select an appropriate method to determine each candidate mutation site to be targeted according to the high-throughput sequencing data of the gene sample, and further obtain the characteristic information of each candidate mutation site in the gene sample. For example, it may further include: and comparing the high-throughput sequencing data of the gene sample with the human reference genome data to identify candidate mutation sites in the gene sample. The high-throughput sequencing data of the gene sample can be a Fastq file or the like, which can be obtained by converting (e.g., by software such as BCL2 Fastq) off-line data of high-throughput sequencing (e.g., by BCL2 Fastq), comparing (e.g., by BWA algorithm or the like) the high-throughput sequencing data of the gene sample with human reference genome data, and converting the comparison result into a BAM file (e.g., by software such as samtools), and further identifying candidate mutation sites in the gene sample according to the comparison result (e.g., identifying by vardit algorithm or the like). For another example, the feature information can be extracted from an appropriate file (e.g., BAM file, etc.), and the feature information specifically includes one or more of depth, variation depth, plus strand base depth, minus strand base depth, genotype, mutation abundance, strand deviation, position on the read fragment, standard deviation of position on the read fragment, average base quality score, standard deviation of base quality score, alignment quality, high quality fragment ratio, high quality fragment mutation abundance, microsatellite loci, microsatellite locus unit length, total number of mismatches on the fragment, 5 'terminal sequence, 3' terminal sequence, mutation type, repetition ratio, etc., and can be calculated by referring to Lai Z, markoves a, ahdesmoki M, Chapman B, Hofmann O, McEwen R, Johnson J, Dougherty B, Barrett JC, Dry JR. Vardict a novel and versatile variant capacitor for next-generation sequencing in cancer research. nucleic Acids Res.2016Jun 20; 44(11) e108.doi:10.1093/nar/gkw227.Epub 2016Apr 7.PMID: 27060149; PMCID 4914105. The corresponding names and the feature descriptions in chinese and english of the feature information can be shown in table 1.

TABLE 1

In the method for detecting mutation by high throughput sequencing, the obtained feature information of each candidate mutation site may be given in a matrix manner, for example, a feature matrix M (Mi, j) of all candidate mutation sites may be formed, where Mi, j is a specific numerical value of jth feature information of an ith candidate mutation site.

The high-throughput sequencing mutation detection method provided by the invention can further comprise the following steps: and performing Hierarchical clustering analysis (Hierarchical clustering) on each candidate mutation site based on the characteristic information of each candidate mutation site and a target value s, wherein the target value s is obtained by calculating an average value a of distances d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located and an average value b of distances d between each candidate mutation site and the candidate mutation site in the closest classification. In hierarchical clustering analysis, a high threshold of a clustering layer may be generally h, h may be selected to be optimized in a self-adaptive manner, a value of h generally satisfies min (h) < ═ h < ═ max (h), a value range of h may be generally determined according to a set to be optimized (e.g., the above-mentioned feature matrix M (Mi, j)), the above-mentioned algorithm may be derived from hierarchical clustering analysis and may be obtained by an hclust function of software R, an optimized target value may be a target value s as described above, and when h varies within a certain value range, different hierarchical clustering analysis results may correspond to different target values s. Generally speaking, a smaller target value s indicates a smaller intra-cluster difference, which indicates a better clustering effect. For example, the hierarchical clustering analysis result is provided in the case where the target value s is the minimum. For another example, the calculation method of the target value s may be as follows:

wherein a is the average value of the distance d between each candidate mutation site and other candidate mutation sites in the classification where the candidate mutation site is located;

b is the average value of the distance d between each candidate mutation site and the candidate mutation site in the nearest classification;

in the above formula, the distance d between two sites is usually the difference of the abundances of two sites, and more specifically, the absolute value of the difference of the abundances of two sites.

The high-throughput sequencing mutation detection method provided by the invention can further comprise the following steps: and obtaining background values pbg of various candidate mutation sites based on the mutation abundance of the candidate mutation sites according to the hierarchical clustering analysis result. After the hierarchical clustering analysis result is obtained, the background value pbg of each type of candidate mutation site can be obtained based on the mutation abundance of the same type of candidate mutation site according to the mutation abundance of each candidate mutation site and the classification result thereof. For example, the background value pbg for each class of candidate mutation sites can be the median of the abundance of mutations for each candidate mutation site in the class.

The high-throughput sequencing mutation detection method provided by the invention can further comprise the following steps: and according to the hierarchical clustering analysis result, acquiring the lowest detection lower limit of each type of candidate mutation sites based on the depth of the candidate mutation sites and the background value pbg of the classification in which the candidate mutation sites are located. After the hierarchical clustering analysis result is obtained, the lowest detection lower limit of each type of candidate mutation site can be obtained based on the depth of each candidate mutation site and the background value pbg of the classification of the candidate mutation site according to the classification result. For example, the lowest detection limit for each candidate mutation site can be calculated by:

f＝ln(1-p)/-n

wherein f is the lowest detection lower limit of the candidate mutation sites;

p is the background value pbg of the classification in which the candidate mutation site is located;

and n is the depth of the candidate mutation site.

The high-throughput sequencing mutation detection method provided by the invention can further comprise the following steps: and determining the detection threshold of each candidate mutation site according to the background value pbg of the category of each candidate mutation site and the lowest detection lower limit of each candidate mutation site. Generally speaking, the larger value of the background value pbg of the category of each candidate mutation site and the lowest detection limit of each candidate mutation site can be used as the detection threshold of each candidate mutation site, because the lowest detection limit is the theoretical lowest value that can be reached under the depth of the site, the background value determines the magnitude of background noise, if the lowest detection limit is smaller than the background value, the lowest detection limit is used as the threshold, otherwise, the lowest detection limit can only reach the background value as the lower limit.

The high-throughput sequencing mutation detection method provided by the invention can further comprise the following steps: and obtaining the mutation detection result of each candidate mutation site according to the detection threshold of each candidate mutation site. Generally speaking, the detection threshold of each candidate mutation site may correspond to the abundance of the mutation in the feature information of each candidate mutation site, and the mutation detection result of each candidate mutation site may be obtained according to the comparison result of the two. For example, when the mutation abundance of the candidate mutation site is greater than or equal to the detection threshold of the candidate mutation site, the mutation of the candidate mutation site in the gene sample can be considered as positive. For another example, when the mutation abundance of the candidate mutation site is less than the detection threshold of the candidate mutation site, the mutation of the candidate mutation site in the gene sample can be considered as negative.

A second aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the high throughput sequencing mutation detection method provided by the first aspect of the present invention.

A third aspect of the invention provides an apparatus comprising: a processor and a memory, the memory being configured to store a computer program, the processor being configured to execute the computer program stored by the memory to cause the apparatus to perform the method for high throughput sequencing mutation detection provided by the first aspect of the present invention.

A fourth aspect of the present invention provides an apparatus, comprising:

the detection threshold calculation module is used for determining the detection threshold of each candidate mutation site according to the background value pbg of the category where each candidate mutation site is located and the lowest detection lower limit of each candidate mutation site;

In the present invention, the operation principle of each module in the above-mentioned apparatus can refer to the high throughput sequencing mutation detection method provided in the first aspect of the present invention, and is not described herein again.

The high-throughput sequencing mutation detection method provided by the application can intelligently set the analysis threshold value aiming at the specific data characteristics of each sample through the self-adaptive optimization algorithm, remove false positives caused by noise as far as possible on the premise of ensuring the detection of positive sites, and can detect mutation from the sample (such as a tumor sample) more accurately, thereby having good industrialization prospect.

The present application is further illustrated by the following examples, which are not intended to limit the scope of the present application.

Example 1

Taking the whole analysis process started by the machine-off of the detection data of the tumor sample as an example, the method specifically comprises the following steps:

1) the data of the sequencing machine is separated from the data of the sequencing BCL by BCL2fastq, sample data is converted into a fastq file, and parameters BCL2 fastq-barcode-mismatches 1-o./multiplex-align-missing-BCLs-no-lane-splitting are used;

2) the Fastq file is compared to a human reference genome through a BWA algorithm and is converted into a BAM file by utilizing samtools software, and the parameters of BWA mem-t 16-R "@ RG \ tID, DNA \ tLB, DNA \ tSM, S2100019497-Plasma \ tPL, ILLUMINA" -Mhuman _ g1k _ v37_ decoy.fasta are used;

3) identifying all candidate mutations of the sample by using a VarCit algorithm and using a parameter VarCit-b bam-p-G REF-c 1-S2-E3-G5;

4) for any candidate mutation in the sample, extracting all the characteristics in the first table by using VarCit, and constructing a characteristic matrix;

5) performing hierarchical clustering by using an R, hclust function based on the characteristic matrix, calculating an s value under each h by using 0.01 as the step length of h change, and determining an optimal classification mode according to s;

6) defining the median abundance of the variation within each class as the background value for that class, and defining an AF threshold in combination with the mutation depth, the threshold being defined as the value that is greater for both the background value pbg for the class and the lowest lower detection limit for each candidate mutation site;

7) screening a candidate mutation list, and marking the mutation higher than the threshold value as a real mutation.

8) Standard library construction and sequencing were performed using standard samples of known mutation sites (e.g. Horizon HD780), and the resulting sequencing data was processed as described above, with the alignment shown in table 1:

TABLE 1

Treatment method	True positive site	False positive sites
			The patented method	8/8	5
Standard analysis procedure (1-3 steps)	8/8	61

Therefore, the screening algorithm provided by the invention can obviously reduce the detection of false positive sites on the premise of detecting true positive sites.

In conclusion, the present invention effectively overcomes various disadvantages of the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

12页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：样品细菌物种检测方法和系统

High-throughput sequencing mutation detection method, equipment, device and readable storage medium

相关技术

网友询问留言