CYP21A2 gene NGS data analysis method, device and application

文档序号:1906619 发布日期:2021-11-30 浏览:19次 中文

阅读说明:本技术 Cyp21a2基因ngs数据分析的方法、装置及应用 (CYP21A2 gene NGS data analysis method, device and application ) 是由 刘风侠 孙隽 周梅珍 许莹硕 樊春娜 王垚燊 彭智宇 于 2021-09-09 设计创作,主要内容包括:本申请公开了一种CYP21A2基因NGS数据分析的方法、装置及应用。本申请方法包括,对芯片捕获区域设定窗口和滑动窗口,根据每个窗口深度及GC含量进行深度修正,对修正后的深度,根据隐马模型计算各窗口拷贝数;统计CYP21A2基因区域序列,挑选成对序列,根据成对序列区间内目标位点和辅助位点位置,判定目标位点和辅助位点的碱基,对照参考序列目标位点真假基因处碱基,判定该成对序列支持真基因突变或未突变;利用待测样本拷贝数校正的值,对照每个位点设定阈值,确定小于阈值的样本为突变候选样本;综合以上结果,添加各样本突变位点数提示CNV,获得CYP21A2基因的CNV和点突变结果。本申请方法,利用高通量测序数据能准确有效的获得CYP21A2基因的拷贝数变异和点突变信息。(The application discloses a method, a device and application for analyzing CYP21A2 gene NGS data. Setting a window and a sliding window for a chip capture area, performing depth correction according to the depth and GC content of each window, and calculating the copy number of each window according to a hidden horse model for the corrected depth; counting CYP21A2 gene region sequences, selecting paired sequences, judging the bases of a target site and an auxiliary site according to the positions of the target site and the auxiliary site in a paired sequence interval, contrasting the bases of a true gene and a false gene of the target site of a reference sequence, and judging whether the paired sequences support true gene mutation or non-mutation; setting a threshold value by using the copy number correction value of the sample to be detected and contrasting each site, and determining the sample smaller than the threshold value as a mutation candidate sample; the results were combined, and the number of mutation sites of each sample was added to indicate CNV, thereby obtaining CNV and point mutation results of CYP21A2 gene. According to the method, the copy number variation and point mutation information of the CYP21A2 gene can be accurately and effectively obtained by using high-throughput sequencing data.)

1. A method for analyzing NGS data of CYP21A2 gene, which is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

copy number variation analysis step, including obtaining high-throughput sequencing data of a sample to be tested, setting window length and sliding window for a chip capture area, recording the initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression on the window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture area according to parameters again on the corrected sample depth, taking window depth, performing batch correction on the GC corrected window depth, calculating correlation coefficient, performing quality control to remove low-quality samples, estimating the copy number of each window by a hidden horse model according to the corrected window depth, and estimating the abnormal window of abnormal copy number until the number of continuous abnormal windows reaches a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment for calculating average posterior probability, outputting when the average posterior probability reaches a set threshold value, otherwise, filtering, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result;

the point mutation analysis step comprises the steps of finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected to the sequences of the true and false genes, comparing all the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; based on the differential site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are positioned, bases of target sites and other auxiliary real and false gene differential sites are confirmed by comparing a real and false gene differential site table, bases except the target sites are confirmed to belong to real genes or false genes, the sequences are confirmed to belong to the real genes or the false genes, and the sequences belonging to the real genes are judged to belong to the mutant or the original bases of the real genes, so that the target sites of the real genes are confirmed to be mutated or have no mutation;

a step of real gene base ratio prompt signal analysis, which comprises the statistics of the real gene base ratio of each difference site of real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold according to the probability of normal distribution, using the result of the copy number variation analysis step as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold, and if the base proportion is smaller than the threshold, indicating that mutation exists;

detecting information integration statistics step, including integrating statistics of the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

2. The method of claim 1, wherein: also comprises a high-throughput sequencing data filtering step;

the step of filtering the high-throughput sequencing data comprises the step of filtering raw data obtained by high-throughput sequencing, wherein the filtering principle comprises the following steps: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data.

3. The method of claim 1, wherein: the average sequencing depth of a target area of the high-throughput sequencing data is not less than 100 x, and the sequencing depth of the whole genome is not less than 40 x.

4. A method according to any one of claims 1-3, characterized in that: in the copy number variation analysis step, if the number of continuous abnormal windows reaches a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment, wherein the threshold number is 5;

preferably, the point mutation analysis step further comprises counting the numbers of reads of the support mutation and the support reference sequence, so as to determine the number and proportion of the support reads of the mutation, and taking the mutation site with the number of the support reads larger than or equal to 2, the proportion larger than or equal to 10%, and the number of the support reads of the site sequence larger than 20 as the positive mutation site.

5. An apparatus for analyzing NGS data of CYP21A2 gene, characterized in that: comprises a copy number variation analysis module, a point mutation analysis module, a true gene base proportion cue signal analysis module and a detection information integration statistic module;

the copy number variation analysis module is used for acquiring high-throughput sequencing data of a sample to be detected, setting window length and sliding windows for chip capture areas, recording initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression on window depth-GC content of each chromosome of each sample to obtain GC correction depth, resetting the window length and the sliding length of the chip capture area according to parameters at the corrected sample depth, taking window depth, performing batch correction on the GC correction window depth, calculating correlation coefficients, performing quality control to remove low-quality samples, estimating the copy number of each window by using a hidden horse model according to the corrected window depth, and estimating the number of continuous abnormal windows of abnormal copy numbers to reach a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment for calculating average posterior probability, outputting if the average posterior probability is greater than a set threshold, otherwise, filtering, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result;

the point mutation analysis module is used for finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected with the sequences of the true and false genes to completely compare the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; based on the differential site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are positioned, bases of target sites and other auxiliary real and false gene differential sites are confirmed by comparing a real and false gene differential site table, bases except the target sites are confirmed to belong to real genes or false genes, the sequences are confirmed to belong to the real genes or the false genes, and the sequences belonging to the real genes are judged to belong to the mutant or the original bases of the real genes, so that the target sites of the real genes are confirmed to be mutated or have no mutation;

the real gene base ratio prompt signal analysis module is used for counting the real gene base ratio of each difference site of the real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold according to the probability of normal distribution, using the result of a copy number variation analysis module as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold, and if the base proportion is smaller than the threshold, indicating that mutation exists;

the detection information integration statistical module is used for integrating and counting the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

6. The apparatus of claim 1, wherein: the high-throughput sequencing data filtering module is also included;

the high-throughput sequencing data filtering module is used for filtering the original data obtained by high-throughput sequencing, and the filtering principle comprises the following steps: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data.

7. The apparatus of claim 1, wherein: the average sequencing depth of a target area of the high-throughput sequencing data is not less than 100 x, and the sequencing depth of the whole genome is not less than 40 x.

8. The apparatus according to any one of claims 1-3, wherein: in the copy number variation analysis module, when the number of continuous abnormal windows reaches a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment, wherein the threshold number is 5;

preferably, the point mutation analysis module is further configured to count the numbers of reads of the support mutations and the support reference sequences, so as to determine the number and proportion of the support reads of the mutations, and the mutation sites with the number of the support reads greater than or equal to 2, the proportion greater than or equal to 10%, and the number of the support reads of the site sequences greater than 20 are used as the positive mutation sites.

9. An apparatus for analyzing NGS data of CYP21A2 gene, characterized in that: the apparatus includes a memory and a processor;

the memory including a memory for storing a program;

the processor comprising means for implementing the CYP21a2 gene NGS data analysis of any one of claims 1 to 4 by executing a program stored in said memory.

10. Use of the method according to any one of claims 1 to 4 or the device according to any one of claims 5 to 9 for the manufacture of a kit, gene chip or device for the detection of 21-hydroxylase-deficient mutations.

Technical Field

The application relates to the technical field of high-throughput sequencing data analysis, in particular to a method, a device and application for analyzing CYP21A2 gene NGS data.

Background

Congenital adrenal cortical hyperplasia (CAH) is an autosomal recessive genetic disease with the global incidence of 1/15000, and related pathogenic genes comprise CYP21A2 (21-hydroxylase), CYP11B1, HSD3B2 and CYP17A1, wherein the congenital adrenal cortical hyperplasia caused by the CYP21A2 gene accounts for 90% -95% of CAH patients, and the incidence of the congenital adrenal cortical hyperplasia is 1/10000-1/20000. The 21-hydroxylase deficiency type includes classic and atypical, and classic includes severe salt loss type and simple masculinization type, wherein the former accounts for about 75% of the number of patients, and the latter 25%. In the screening project of the carrier, the metabolic disease types screened by the carrier are expanded, and the prevention gateway of the metabolic disease is moved forward to the pre-pregnancy/pre-delivery, so that the requirements of different crowds can be met, and the early discovery and early prevention of the metabolic disease are facilitated, and scientific preparation for pregnancy is guided.

The CYP21A2 gene and its pseudogene CYP21A1P have a total length of about 3.3kb, each of which contains 10 exons, and have highly homologous sequences, wherein the homology of the exon region is 98%, the homology of the intron region is 96%, and the difference sites are mostly concentrated in intron 2. Pathogenic variations include point mutations, copy number deletions, and loss of gene function resulting from gene fusion. Wherein the 30K deletion resulting from unequal cross-over is about 25%, 75% are point mutations resulting from gene conversion, and less than 5% of true gene spontaneous mutations.

Due to the existence of homologous sequences and micro-conversion between true and false genes, the traditional detection modes of CYP21A2 are supplementary experiments, including first-generation sequencing sanger, MLPA and multiplex PCR, and the detection modes not only increase the cost, but also are difficult to deliver, and seriously restrict the detection and application of CYP21A2 gene.

Second generation sequencing (abbreviated NGS), i.e., high throughput sequencing, can perform sequencing on hundreds of thousands to millions of DNA in parallel at a time, and is an important technical means for gene detection and research. NGS data, i.e. data obtained by high throughput sequencing. However, the NGS data cannot accurately and effectively detect highly homologous gene mutations, which greatly limits the application of NGS in the detection of CYP21a2 gene.

Therefore, how to more accurately and effectively distinguish the mutations of the highly homologous CYP21A2 gene and the pseudogene CYP21A1P thereof through NGS is a problem to be solved in the field.

Disclosure of Invention

The application aims to provide a novel method, a device and application for analyzing NGS data of CYP21A2 gene.

In order to achieve the purpose, the following technical scheme is adopted in the application:

the first aspect of the present application discloses a method for analyzing the NGS data of the CYP21a2 gene, comprising the steps of:

copy number variation analysis step, including obtaining high throughput sequencing data (i.e. NGS data) of a sample to be tested, setting window length and setting a sliding window for a chip capture area, recording the start and end coordinates of the set window for each chip capture area, calculating average depth and GC content, performing Lewis regression to the window depth-GC content of each chromosome of each sample to obtain GC correction depth, resetting the window length and the sliding length of the chip capture area according to parameters at the corrected sample depth, in one implementation manner of the application, resetting the window length and the sliding length to 30 and 25 according to parameters, taking the window depth, performing batch correction to the window depth after GC correction, calculating correlation coefficients, and performing quality control to remove low-quality samples, in one implementation manner of the application, the average of the correlation coefficients between the chromosome and other sample chromosomes is less than 0.8, the method comprises the steps that if the number of chromosomes of a sample meeting conditions is less than 4, the sample is a low-quality sample, the copy number of each window is estimated by a hidden horse model according to the corrected window depth, and an abnormal window with abnormal copy number is obtained; in an implementation manner of the present application, a length greater than a set threshold specifically means a fragment length of at least 130 bp;

the point mutation analysis step comprises the steps of finding out all true and false gene difference sites of the CYP21A2 gene and the CYP21A1P gene through sequence comparison of the CYP21A2 gene and the CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene difference sites and bases at corresponding positions to obtain a true and false gene difference site table, comparing a sample to be detected with all sequences of the true and false genes to obtain a truthful gene, searching variation existing in most samples, checking and confirming the difference sites of the sample to be detected belonging to the true and false genes, and adding the difference sites to the true and false gene difference site table; based on the differential site information of a sample to be detected, hot spot mutation is marked, sequence IDs at hot spot positions are searched in a comparison file, comparison positions of paired sequences are recorded, matched sequences of the marked hot spot position sequences are searched in the comparison file of a certain region near a circulating hot spot, such as a region with 2K expansion on the left and the right of the circulating hot spot, each pair of sequences is analyzed, the position of point mutation or insertion deletion of mutation is positioned, bases of target sites and other auxiliary real and false gene differential sites are confirmed by comparing a real and false gene differential site table, bases except the target sites are confirmed to belong to real genes or false genes, so that whether the sequences belong to the real genes or the false genes is confirmed, and for the sequences belonging to the real genes, whether the target sites belong to the original bases of the mutant genes or the real genes is judged, so that the target sites of the real genes are confirmed to be mutated or not mutated, counting the number of sequences supporting mutation and a reference sequence at the same time, wherein the reference sequence refers to a human reference genome, so as to determine the number and proportion of the mutation, and the number and proportion are used as the judgment basis for subsequent quality filtering; in an implementation manner of the present application, specific conditions of subsequent quality filtering are that the number of supported reads is greater than or equal to 2, the proportion is greater than or equal to 10%, and the number of supported reads of a site sequence is greater than 20;

a step of real gene base ratio prompt signal analysis, which comprises the statistics of the real gene base ratio of each difference site of real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, namely the number of sequences (reads) supporting various bases, and then combining the numbers together to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a fake gene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences (namely human reference genome) at the position of the true gene, and counting the number of sequences without mutation and deletion mutation at the position of the fake gene, wherein due to sequence inversion, the number of inserted sequences without mutation is equivalent to the reference sequence of the true gene without mutation, and the counting results of the two parts are combined together in the same way as single base calculation to calculate the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, specifically, 545 samples are counted, the mean value and the standard deviation of the samples are calculated, a small probability threshold is calculated according to the probability of normal distribution, the result of the copy number variation analysis step is used as a correction factor for the sample to be detected, the base proportion of the true gene of normal copy is converted, the base proportion is compared with a set threshold, and if the base proportion is smaller than the threshold, mutation is predicted; in one implementation of the present application, the specific set threshold is the average value minus two times the standard deviation, and it can be understood that the set threshold at each location is different, and the general range is 0.21-0.36;

detecting information integration statistics step, including integrating statistics of the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained. As an aid to the fragment mutation CNV, for example, if the number of mutation sites of one sample > is 3, the presence of CNV is suggested.

In the present application, the true gene "CYP 21A2 gene" and the pseudogene "CYP 21A1P gene" are true genes and pseudogenes. According to the method, the NGS data is utilized to realize the development of the high homologous sequence gene mutation method, the detection of the related gene mutation in the NGS to the metabolic disease CAH is realized, the NGS data is utilized more fully, the detectable genetic disease range of the NGS is expanded, compared with the traditional detection mode of the high homologous gene, the detection flow is simplified, the detection flux is improved, the detection cost is reduced, and the product competitiveness is improved. When the analysis method is realized by using a computer program, the known pathogenic variation can be automatically judged and output, the manual interpretation cost of the variation is saved, and the method is suitable for screening healthy people or diagnosing the genetic etiology of CAH patients.

The high-throughput sequencing data analysis method can be used for detecting the gene mutation of the 21-hydroxylase deficiency, so that the gene mutation type of a detected object is determined, and intermediate reference data is provided for early discovery and early prevention of metabolic diseases and guidance of scientific pregnancy preparation. It can be understood that the analysis method of the present application is only to detect gene mutation, and whether the disease is still detected or not needs to be judged according to specific clinical signs; therefore, the analysis method of the present application is only an analysis method aiming at high-throughput sequencing data of the CYP21A2 gene, and the direct output result is only the mutation condition of the CYP21A2 gene and is not a diagnosis method of 21-hydroxylase deficiency.

In one implementation of the present application, the method for analyzing the NGS data of the CYP21a2 gene further comprises a high throughput sequencing data filtering step; the high-throughput sequencing data filtering step comprises the step of filtering raw data obtained by high-throughput sequencing, wherein the filtering principle comprises the following steps: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data. Wherein the obtained high-quality high-throughput sequencing data is used for the subsequent copy number variation analysis step, the point mutation analysis step and the true gene base ratio prompt signal analysis step.

In one implementation of the present application, the target region average sequencing depth of the high-throughput sequencing data is not less than 100 ×, and the whole genome sequencing depth is not less than 40 ×.

In an implementation manner of the present application, in the copy number variation analysis step, if the number of consecutive abnormal windows reaches a set threshold number, the consecutive abnormal windows are connected to form a copy number variation fragment, where the threshold number is 5.

In one implementation manner of the present application, the point mutation analyzing step further includes counting the numbers of reads of the support mutation and the support reference sequence, so as to determine the number and the proportion of the support reads of the mutation, and taking the mutation site with the number of the support reads greater than or equal to 2, the proportion greater than or equal to 10%, and the number of the support reads of the site sequence greater than 20 as the positive mutation site.

The second aspect of the application discloses a device for analyzing the NGS data of the CYP21A2 gene, which comprises a copy number variation analysis module, a point mutation analysis module, a true gene base proportion cue signal analysis module and a detection information integration statistical module;

a copy number variation analysis module for obtaining high flux measurement sequencing data of a sample to be tested, setting window length and sliding window for a chip capture area, recording the initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression to the window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture area according to parameters again on the corrected sample depth, taking window depth, performing batch correction to the GC corrected window depth, calculating correlation coefficient, performing quality control to remove low-quality samples, estimating the copy number of each window by a hidden horse model according to the corrected window depth, and estimating the abnormal window of abnormal copy number until the number of continuous abnormal windows reaches a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment for calculating average posterior probability, outputting if the average posterior probability is greater than a set threshold, otherwise, filtering, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result;

the point mutation analysis module is used for finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected to the sequences of the true and false genes, comparing the sequences of the true and false genes to the true and false genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; marking hotspot mutations on the basis of the differential site information of a sample to be detected, searching sequence IDs at hotspot positions in a comparison file, simultaneously recording the comparison positions of paired sequences, searching the matched sequences of the marked hotspot position sequences in the comparison file in a certain region near a circulating hotspot, analyzing each pair of sequences, positioning the positions of the mutated point mutations or insertion deletion, confirming bases of target sites and other auxiliary true and false gene differential sites by contrasting a true and false gene differential site table, confirming whether the sequences belong to true genes or false genes by confirming the bases except the target sites belong to the true genes or the false genes, and judging whether the target sites belong to the sequences belonging to the true genes or the original bases of the true genes by confirming that the target sites of the true genes are mutated or not mutated;

the real gene base ratio prompt signal analysis module is used for counting the real gene base ratio of each difference site of the real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold according to the probability of normal distribution, using the result of a copy number variation analysis module as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold, and if the base proportion is smaller than the threshold, indicating that mutation exists;

the detection information integration statistical module is used for integrating and counting the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

In one implementation manner of the present application, the device for analyzing the NGS data of the CYP21a2 gene further includes a high throughput sequencing data filtering module; the high-throughput sequencing data filtering module is used for filtering the original data obtained by high-throughput sequencing, and the filtering principle comprises the following steps: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data. The obtained high-quality high-throughput sequencing data are used for a subsequent copy number variation analysis module, a point mutation analysis module and a true gene base ratio prompt signal analysis module.

The apparatus for analyzing the NGS data of the CYP21a2 gene of the present invention actually realizes each step of the method for analyzing the NGS data of the CYP21a2 gene of the present invention by each module; therefore, specific limitations of each module can be referred to the method for analyzing the NGS data of the CYP21a2 gene of the present application, which will not be described herein in detail.

A third aspect of the present application discloses an apparatus for NGS data analysis of CYP21a2 gene, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising means for performing the CYP21a2 gene NGS data analysis of the present application by executing a program stored in a memory.

It is understood that the method for analyzing the NGS data of the CYP21a2 gene of the present application may be implemented by a program; and the computer program may be stored in a memory or a computer-readable storage medium. When a program capable of implementing the method for analyzing the NGS data of the CYP21a2 gene of the present application is stored in a computer-readable storage medium, the computer-readable storage medium can be independently used or sold as a product. Accordingly, the present application further discloses a computer-readable storage medium having a program stored therein, the program being executable by a processor to implement the method of analyzing NGS data of the CYP21a2 gene of the present application.

The fourth aspect of the present application discloses the use of the method for analyzing the NGS data of the CYP21a2 gene or the apparatus for analyzing the NGS data of the CYP21a2 gene of the present application in the preparation of a kit, gene chip or apparatus for 21-hydroxylase deficiency mutation detection.

It can be understood that the kit for detecting 21-hydroxylase deficiency mutation prepared by the method or the device for analyzing the NGS data of the CYP21A2 gene mainly means that reagents for carrying out relevant experiments are assembled into a kit specially used for detecting 21-hydroxylase deficiency mutation on the basis of the analysis method. Similarly, the application of the chip in preparing the gene chip for detecting the 21-hydroxylase deficiency mutation is to prepare a capture chip or a sequencing chip involved in high-throughput sequencing into the gene chip specially used for detecting the 21-hydroxylase deficiency mutation.

Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:

the method and the device for analyzing the NGS data of the CYP21A2 gene realize mutation analysis and detection of the CYP21A2 gene and the pseudogene CYP21A1P with high homology by using high-throughput sequencing data, can accurately and effectively obtain copy number variation and point mutation information of the CYP21A2 gene, and lay a foundation for popularization and application of high-throughput sequencing in CYP21A2 gene mutation detection.

Drawings

FIG. 1 is a block diagram showing a flow chart of a method for analyzing NGS data of the CYP21A2 gene in an example of the present application;

FIG. 2 is a block diagram showing the structure of an apparatus for analyzing NGS data of the CYP21A2 gene in the example of the present application.

Detailed Description

The present application will be described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in this specification in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they can be fully understood from the description in this specification and the general knowledge of the art.

The method for analyzing the NGS data of the CYP21A2 gene comprises a copy number variation analysis step 12, a point mutation analysis step 13, a true gene base ratio prompt signal analysis step 14 and a detection information integration statistical step 15 as shown in figure 1.

Wherein, the copy number variation analysis step 12 comprises obtaining high-throughput sequencing data of a sample to be tested, setting window length and setting a sliding window for a chip capture area, recording the initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression on the window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture area according to parameters again on the corrected sample depth, taking window depth, performing batch correction on the GC corrected window depth, calculating correlation coefficients, performing quality control to remove low-quality samples, estimating the copy number of each window by using a hidden horse model according to the corrected window depth, and estimating the abnormal window of abnormal copy number until the number of continuous abnormal windows reaches a set threshold number, and connecting the continuous abnormal windows into a copy number variation fragment for calculating the average posterior probability, outputting if the average posterior probability is greater than a set threshold, filtering out if the average posterior probability is not greater than the set threshold, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result.

In the application, copy number variation analysis, namely CNV analysis, sets window length and sets a sliding window for a chip capture region, then performs depth correction according to the depth and GC content of each window, sets the window size and sliding window size again in the corrected sample depth, corrects batch samples, calculates the copy number of each window according to a hidden horse model, and outputs CNV segments when the CNV signals of several continuous windows reach a threshold value. And screening the sample with CNV mutation in the CYP21A2 gene region according to the output result.

In an implementation manner of the application, for parameters used for CNV analysis, the size and the sliding size of a window, the size of a CNV minimum window and the size of a sliding window are calculated by adjusting the process detection depth, the number of minimum CNV fragments is also minimum, the positive detection rate and the negative detection rate are respectively calculated under different parameter settings, an ROC curve is drawn, the size of a depth calculation window is finally determined to be 200bp sliding 20bp, the length of the CNV detection window is 30bp sliding 25bp, and minimum 5 continuous window CNV signals are output.

A point mutation analysis step 13, which comprises the steps of finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected with the sequences of the true and false genes, comparing all the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites of the true and false genes into the true and false gene differential site table; based on the difference site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are located, bases of target sites and other auxiliary true and false gene difference sites are confirmed by comparing a true and false gene difference site table, bases except the target sites are confirmed to belong to true genes or false genes, the sequences are confirmed to belong to the true genes or the false genes, and the sequences belonging to the true genes are judged to belong to the mutant or true gene original bases, so that the true gene target sites are confirmed to be mutated or have no mutation.

In the application, the point mutation analysis step adopts an auxiliary site method to analyze the point mutation, namely, the sequence of a CYP21A2 gene region is counted, a sequence positioned at a target site is screened out, then a paired sequence is selected according to the sequence ID and the record of the comparison position of the paired sequence, the number of auxiliary sites contained in the interval is determined according to the starting and ending positions of the paired sequence, the base of the target site and the base of the auxiliary site are determined, the base of the target site at the true and false target sites of a reference sequence is contrasted, the paired sequence is determined to support the true gene mutation or no mutation, and the sequence number meeting the conditions is circularly counted in sequence.

In one implementation manner of the present application, the point mutation analyzing step further includes counting the numbers of reads of the support mutation and the support reference sequence, so as to determine the number and the proportion of the support reads of the mutation, wherein the number of the support reads is greater than or equal to 2, the proportion is greater than or equal to 10%, the number of the support reads of the site sequence is greater than 20, the mutation sites are used as positive mutation sites, and the rest mutation sites which do not satisfy the above conditions are used as false positive mutation sites. The minimum reads number and the proportional quality control threshold set by the auxiliary site method are obtained by obtaining the number of supported reads > 2, the proportion > 10% and the number of supported sites greater than 20 according to the positive detection rate and the false positive rate of the historical samples under different thresholds.

A step 14 of true gene base ratio prompt signal analysis, which comprises the statistics of true gene base ratio of each difference site of true and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion and the total depth of the true gene bases; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold value according to the probability of normal distribution, using the result of the copy number variation analysis step as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold value, and if the base proportion is smaller than the threshold value, indicating that mutation exists.

In the step of analyzing the true gene base ratio cue signal, i.e. the true gene base ratio cue signal, the true gene base ratio of each differential site is calculated for each actual sample, then the copy number of the actual sample is used to correct the value of the detected actual sample, and then the sample smaller than the threshold is determined as the mutation candidate sample by contrasting the threshold set for each site.

In one implementation of the present application, the true gene base ratio threshold is determined by inferring most of the samples as normal samples according to a large number of historical samples, removing extreme values during statistics, and making statistics on a mean value and a standard deviation close to the mean value and the standard deviation of the normal samples, wherein each site has its own individual threshold due to the difference of the ratio of each site, and assuming that the distribution of the batch of normal samples is approximately normal distribution, under this assumption, calculating the single-ended small probability event boundary mean value minus 2-fold variance (probability is about < 2%), ranging from 0.21 to 0.36, and in addition, since the true and false genes of c.1360c > T are all C, the mutation detection base is T, the calculated C ratio is close to 1, and the threshold is 0.99. For the single sample global mutation number prediction CNV, the threshold was set to 3, and the probability of 2-site co-mutation based on genetics was higher, while the probability of 3-site co-mutation was lower.

Detecting information integration statistics step 15, including integrating statistics of the number of point mutations detected in different ways in each sample as an aid to the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

In one implementation of the present application, specifically, the total number of mutation sites of the two methods per sample is added, suggesting the possibility of CNV.

In one implementation of the present application, the method for analyzing the NGS data of the CYP21a2 gene, as shown in fig. 1, further comprises a high throughput sequencing data filtering step 11. The high-throughput sequencing data filtering step 11 includes filtering raw data obtained by high-throughput sequencing, wherein the filtering principle includes: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data. The obtained high-quality high-throughput sequencing data is used for the subsequent copy number variation analysis step, point mutation analysis step and true gene base ratio cue signal analysis step.

In the present application, the high throughput sequencing data is NGS data of conventional library construction, the length of the insert is about 250bp, the average depth of the panel or WES target region is not less than 100 x, and the depth of WGS is not less than 40 x. The processed sequencing data were aligned to the human reference genome (GRCh 37).

Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated on a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above methods may be implemented.

Therefore, based on the method for analyzing the CYP21a2 gene NGS data of the present application, the present application provides an apparatus for analyzing the CYP21a2 gene NGS data, as shown in fig. 2, comprising a copy number variation analysis module 22, a point mutation analysis module 23, a true gene base ratio cue signal analysis module 24 and a detection information integration statistic module 25.

Wherein, the copy number variation analysis module 22 is used for obtaining high throughput sequencing data of a sample to be tested, setting window length and setting a sliding window for a chip capture region, recording the initial and end coordinates of the set window of each chip capture region, calculating average depth and GC content, performing Lewis regression on the window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture region again according to parameters in the corrected sample depth, taking window depth, performing batch correction on the GC corrected window depth, calculating correlation coefficients, performing quality control to remove low-quality samples, estimating the copy number of each window by using a hidden horse model according to the corrected window depth, and estimating the abnormal window of abnormal copy number, wherein the number of continuous abnormal windows reaches the set threshold number, and connecting the continuous abnormal windows into a copy number variation fragment for calculating the average posterior probability, outputting if the average posterior probability is greater than a set threshold, filtering out if the average posterior probability is not greater than the set threshold, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result.

The point mutation analysis module 23 is used for finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected with the sequences of the true and false genes to completely compare the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; based on the difference site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are located, bases of target sites and other auxiliary real and false gene difference sites are confirmed by comparing a real and false gene difference site table, bases except the target sites are confirmed to belong to real genes or false genes, the sequences are confirmed to belong to the real genes or the false genes, and the sequences belonging to the real genes are judged to belong to the mutant or the original bases of the real genes, so that the target sites of the real genes are confirmed to be mutated or have no mutation.

The real gene base ratio prompt signal analysis module 24 is used for counting the real gene base ratio of each difference site of the real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion and the total depth of the true gene bases; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold value according to the probability of normal distribution, using the result of a copy number variation analysis module as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold value, and if the base proportion is smaller than the threshold value, indicating that mutation exists.

A detection information integration statistical module 25 for integrating and counting the number of point mutations detected in different ways in each sample as an aid to the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

Further, the device for analyzing the NGS data of the CYP21A2 gene also comprises a high-throughput sequencing data filtering module 21; the high-throughput sequencing data filtering module 21 is configured to filter raw data obtained by high-throughput sequencing, where the filtering principle includes: filtering to remove sequences with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequences, sequences with the average mass of less than 20 and sequences with the N base number of more than 10%, and filtering to obtain high-quality high-throughput sequencing data. The obtained high-quality high-throughput sequencing data are used for a subsequent copy number variation analysis module, a point mutation analysis module and a true gene base ratio prompt signal analysis module.

The device can realize the method for analyzing the NGS data of the CYP21A2 gene by utilizing the mutual coordination of all modules, particularly realize the corresponding steps in the method by all the modules of the device, thereby realizing the automatic detection of the mutation of the CYP21A2 gene.

There is also provided in another implementation of the present application, an apparatus for NGS data analysis of a CYP21a2 gene, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising instructions for implementing the following method by executing a program stored in a memory: copy number variation analysis step, including obtaining high-throughput sequencing data of a sample to be tested, setting window length and sliding window for a chip capture area, recording the initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression on the window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture area according to parameters again on the corrected sample depth, taking window depth, performing batch correction on the GC corrected window depth, calculating correlation coefficient, performing quality control to remove low-quality samples, estimating the copy number of each window by a hidden horse model according to the corrected window depth, and estimating the abnormal window of abnormal copy number until the number of continuous abnormal windows reaches a set threshold number, connecting the continuous abnormal windows into a copy number variation fragment for calculating average posterior probability, outputting if the average posterior probability is greater than a set threshold, otherwise, filtering, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result; the point mutation analysis step comprises the steps of finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected to the sequences of the true and false genes, comparing all the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; based on the differential site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are positioned, bases of target sites and other auxiliary real and false gene differential sites are confirmed by comparing a real and false gene differential site table, bases except the target sites are confirmed to belong to real genes or false genes, the sequences are confirmed to belong to the real genes or the false genes, and the sequences belonging to the real genes are judged to belong to the mutant or the original bases of the real genes, so that the target sites of the real genes are confirmed to be mutated or have no mutation; a step of real gene base ratio prompt signal analysis, which comprises the statistics of the real gene base ratio of each difference site of real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold according to the probability of normal distribution, using the result of the copy number variation analysis step as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold, and if the base proportion is smaller than the threshold, indicating that mutation exists; detecting information integration statistics step, including integrating statistics of the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

Aiming at the method and the device for analyzing the NGS data of the CYP21A2 gene, a kit, a gene chip or a device for detecting 21-hydroxylase deficiency mutation can be further prepared. For example, on the basis of the analytical method of the present application, reagents for carrying out relevant experiments are assembled into a kit specifically for the detection of 21-hydroxylase deficiency mutations; preparing a capture chip or a sequencing chip related to high-throughput sequencing into a gene chip special for 21-hydroxylase deficiency mutation detection; according to the flow of the analytical method of the present application or the structure of the analytical device of the present application, a device dedicated to the automated detection of 21-hydroxylase deficiency mutations is assembled.

The following examples are intended to be illustrative of the present application only and should not be construed as limiting the present application.

Examples

CYP21A2 gene mutation analysis based on high-throughput sequencing data

DNA of 179 clinical samples was obtained for high throughput sequencing and then analyzed for high throughput sequencing data. For all DNA samples, a specific panel Probe, namely # xGen Lockdown Probe-pp150V1, is used for capturing and library building according to the operation instruction of a BGISEQ | MGISEQ sequencing platform, and sequencing is carried out on a gene sequencer (BGISEQ | MGISEQ), so as to obtain the raw data of high-throughput sequencing.

The method for analyzing the NGS data of the CYP21A2 gene in this example is as follows:

a high-throughput sequencing data filtering step, wherein the raw data obtained by sequencing is filtered, and the filtering principle comprises the following steps: a sequence with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequence, a sequence with the average mass of less than 20 and a sequence with the N base number of more than 10% are filtered to obtain high-quality sequencing data; the processed sequencing data were aligned to the human reference genome (GRCh 37).

Copy number variation analysis, namely obtaining high-quality high-throughput sequencing data obtained after filtering, setting window length and sliding windows for chip capture areas, recording initial and end coordinates of the set window of each chip capture area, calculating average depth and GC content, performing Lewis regression on window depth-GC content of each chromosome of each sample to obtain GC correction depth, setting the window length and the sliding length of the chip capture area again according to parameters at the corrected sample depth, taking window depth, performing batch correction on the GC correction window depth, calculating a correlation coefficient, performing quality control to remove low-quality samples, estimating the copy number of each window by using a hidden horse model according to the corrected window depth, and estimating the number of continuous abnormal windows of abnormal copy number to reach the set threshold number, and connecting the continuous abnormal windows into a copy number variation fragment for calculating the average posterior probability, outputting if the average posterior probability is greater than a set threshold, filtering out if the average posterior probability is not greater than the set threshold, and performing annotation output on the obtained copy number variation fragment to obtain a copy number variation analysis result.

In the CNV analysis of the embodiment, the window length and the sliding window are set for the chip capture area, then the depth correction is carried out according to the depth and the GC content of each window, the window size and the sliding window size are set again on the corrected sample depth, the batch samples are corrected, the corrected depth and the like are calculated, the copy number of each window is calculated according to the hidden horse model, and CNV fragments are output when the CNV signals of a plurality of continuous windows reach the threshold value. And screening the sample with CNV mutation in the CYP21A2 gene region according to the output result.

The point mutation analysis step comprises the steps of finding out all true and false gene differential sites through sequence comparison of CYP21A2 gene and CYP21A1P gene on a human reference genome, outputting the positions of the true and false gene differential sites and bases at corresponding positions to obtain a true and false gene differential site table, comparing a sample to be detected to the sequences of the true and false genes, comparing all the sequences of the true and false genes back to the true genes, searching for variation existing in most samples, checking and confirming the differential sites of the sample to be detected belonging to the true and false genes, and adding the differential sites into the true and false gene differential site table; based on the difference site information of a sample to be detected, hot spot mutation is marked, sequence IDs at the hot spot positions are searched in a comparison file, the comparison positions of paired sequences are recorded, the paired sequences of the marked hot spot position sequences are searched in the comparison file in a certain area near a circulating hot spot, each pair of sequences are analyzed, the positions of the mutant point mutation or insertion deletion are located, bases of target sites and other auxiliary true and false gene difference sites are confirmed by comparing a true and false gene difference site table, bases except the target sites are confirmed to belong to true genes or false genes, the sequences are confirmed to belong to the true genes or the false genes, and the sequences belonging to the true genes are judged to belong to the mutant or true gene original bases, so that the true gene target sites are confirmed to be mutated or have no mutation.

Further, the point mutation analysis step also comprises the step of counting the numbers of reads of the support mutation and the support reference sequence so as to determine the number and proportion of the support reads of the mutation, wherein the number of the support reads is more than or equal to 2, the proportion is more than or equal to 10%, the number of the support reads of the site sequence is more than 20, the mutation sites are used as positive mutation sites, and the rest mutation sites which do not meet the conditions are false positive mutation sites.

The point mutation analysis of the embodiment adopts an auxiliary site method to analyze point mutation, counts sequences in CYP21A2 gene region, screens out sequences positioned at target sites, selects paired sequences according to sequence ID and records of the comparison positions of the paired sequences, determines the number of auxiliary sites contained in the interval according to the starting and ending positions of the paired sequences, determines bases of the target sites and bases of the auxiliary sites, compares bases at the target sites of a reference sequence, determines whether the paired sequences support true gene mutation or non-mutation, and sequentially circularly counts sequence numbers meeting conditions.

A step of real gene base ratio prompt signal analysis, which comprises the statistics of the real gene base ratio of each difference site of real and false genes; for single base difference sites, directly counting the number of various bases at the positions of true and false genes, and then combining the bases to calculate the proportion of the bases of the true genes and the total depth; for insertion and deletion, if a pseudogene is inserted for a true gene, counting the number of inserted and non-inserted reference sequences at the position of the true gene, counting the number of sequences without mutation and deletion mutation at the position of the pseudogene, combining the counting results of the two parts together, and calculating the number and the proportion of base sequences of the true gene; counting a plurality of samples of the same panel, calculating the average value and standard deviation of normal samples, calculating a small probability threshold value according to the probability of normal distribution, using the result of the copy number variation analysis step as a correction factor for the sample to be detected, converting the base proportion of the true gene of normal copy, comparing the base proportion with a set threshold value, and if the base proportion is smaller than the threshold value, indicating that mutation exists.

The real gene base ratio prompt signal analysis of the embodiment comprises the steps of calculating the real gene base ratio of each difference site for each actual sample, correcting the value of the detected actual sample by using the copy number of the actual sample, and determining the sample smaller than the threshold value as a mutation candidate sample by contrasting the threshold value set for each site.

Detecting information integration statistics step, including integrating statistics of the number of point mutations detected in different modes in each sample as the assistance of the copy number variation of the fragment mutation; finally, copy number variation and point mutation information of the CYP21A2 gene of the sample to be detected based on NGS data are obtained.

In this example, the summary of the detection information is the total number of mutation sites of the two methods added to each sample, indicating the presence of CNV. Wherein, the two methods are specific, one is a cnv interval obtained by a direct cnv analysis mode, and the other is the number of sites with the snp ratio smaller than a threshold value; the number of bits > -3 suggests that a CNV may be present.

In the CYP21a2 gene NGS data analysis method of this example, the setting of each threshold is as follows:

for parameters used in CNV analysis, the size and sliding size of a calculation window, the size of a CNV minimum window and the sliding window and the number of minimum CNV fragments are calculated by adjusting the process detection depth, the positive detection rate and the negative detection rate are respectively calculated under different parameter settings, an ROC curve is drawn, the size of the depth calculation window is finally determined to be 200bp sliding 20bp, the length of the CNV detection window is 30bp sliding 25bp, and the minimum 5 continuous window CNV signals are output.

The minimum reads number and the proportional quality control threshold set by the auxiliary locus method are that the threshold supporting reads number > is 2, the proportion > is 10% and the locus sequence supporting number is more than 20 according to 164 point mutation samples analyzed by historical samples.

The confirmation of the true gene base ratio threshold is that most samples are inferred to be normal samples according to 545 samples, extreme values are removed during statistics, the average value and the standard deviation are close to the average value and the standard deviation of the normal samples, each site has a separate threshold value due to the difference of the ratio of each site, the distribution of the batch of normal samples is assumed to be approximately normal distribution, under the assumption, the single-ended small probability event boundary average value minus 2-fold variance (the probability is about < 2%), the range is from 0.21 to 0.36, in addition, because the true and false genes of c.1360C > T are all C, the mutation detection base is T, the calculated C ratio is close to 1, and the threshold value is set to 0.99.

For the single sample global mutation number prediction CNV, the threshold was set to 3, and the probability of 2-site co-mutation based on genetics was higher, while the probability of 3-site co-mutation was lower.

Second, analysis results

In this example, DNA was extracted from 179 clinical specimens and then captured, pooled and sequenced.

And (3) performing quality control filtration on the sequencing data of the off-line device, wherein the filtration principle comprises the following steps: a sequence with the base number of less than or equal to 10 accounting for more than 50% of the total base proportion in the sequence, a sequence with the average mass of less than 20 and a sequence with the N base number of more than 10% are filtered to obtain high-quality sequencing data; aligning the processed sequencing data to a human reference genome (GRCh 37);

CNV analysis, using written program to make CNV detection and judgment based on bam and according to set parameters, and outputting result. Some of the results are shown in table 1.

TABLE 1 CNV test results

In Table 1, sample is the name of sample, QC indicates whether quality control is acceptable, exon-CYP 21A2 is an exon involved in cnv, CN indicates the copy number, and "-" indicates that no mutation is detected.

And (3) analyzing point mutation by using an auxiliary site method, counting the support number of mutation and non-mutation of the hot spot on the basis of bam, judging an output result according to a threshold value, and showing part of results in table 2.

TABLE 2 analysis of the results of point mutations by the helper site method

sample QC GENE cHGVS CYP21A2-ref-all CYP21A2-mut CYP21A2-total_reads ratio
L2_DX_L120-100 Qualified CYP21A2 c.955C>T 7 7 14 0.5
L2_DX_L121-101 Qualified CYP21A2 c.955C>T 10 9 19 0.474
L2_XSE_091-65 Qualified CYP21A2 c.518T>A 8 12 20 0.6
L2_YW_L127-109 Qualified CYP21A2 c.955C>T 14 7 21 0.333
L1_XSE_028-61 Qualified CYP21A2 c.955C>T 20 4 24 0.167
L2_XSE_081-17 Qualified CYP21A2 c.719T>A 18 6 24 0.25
L2_XSE_079-61 Qualified CYP21A2 c.518T>A 17 7 24 0.292

In table 2, sample is the sample name; QC represents whether quality control is qualified; GENE is the name of the detection GENE; cHGVS is a name for the variation standard; CYP21A2-ref-all is the number of reads supporting non-mutations; CYP21A2-mut is the number of reads supporting the mutation; CYP21A2-total _ reads is the sum of mutant reads and non-mutant reads; ratio represents the mutation ratio.

And (3) prompting signals of true gene base ratios, counting true gene ratios of true and false gene differential sites, and outputting results, wherein part of results are shown in a table 3.

TABLE 3 true and false Gene differential sites true Gene ratio statistics

In table 3, sample is the sample name; order indicates the mutation numbering sequence; ref ratio represents the ratio of bases at the corresponding positions of CYP21A2 gene, and the denominator is the sum of the bases at the corresponding positions of true and false genes; total _ reads represents the sum of bases at corresponding positions of true and false genes; pos denotes CYP21a2 chromosomal location; cHGVS indicates the standard designation of mutation at the corresponding position.

The results of the three are integrated and output, and partial results are shown in table 4.

TABLE 4 Integrated output results

In table 4, sample is the sample name; QC represents whether quality control is qualified; GENE is the name of the detection GENE; cHGVS is a name for the variation standard; exon is a gene functional region; CYP21A2-mut denotes the number of reads supporting the mutation; CYP21A2-total _ reads represents the sum of mutant and non-mutant reads; ratio represents the mutation ratio; the cnv-del indicates whether the true gene detects deletion mutation, a marker del is detected, namely a cnv mutation positive sample, and otherwise, the sample is empty; the snp _ ratio _ tag represents that the site base calculated according to the threshold value belongs to the mutation range marked as mut-down, mut-up and ref; mut _ pos _ n represents the number of burst bits of the sample detected by the snp _ ratio method.

The method detects 179 samples in total, and all the samples meet the quality control requirement, 164 samples are subjected to point mutation sample detection, 159 true positive mutation sites, 41 false positive sites and 3 false negative sites are detected in total, the positive prediction value is calculated, wherein PPV (TP/(TP + FP) is 0.795, the recall rate is equal to TP/(TP + FN) is 0.98, and F1score is equal to 2 PPV (PPV) equal/(PPV + recall) is 0.88. The CNV was performed in 56 cases, and the test results were 6 false positives, 1 false negatives, 32 true negatives, and 17 true positives, with a positive prediction value of TP/(TP + FP) of 0.74, a recall rate of recall of TP/(TP + FN) of 0.94, and a recall rate of F1score of 2 PPV recall/(PPV + recall) of 0.83.

The above results show that the method for analyzing the NGS data of the CYP21A2 gene of the present example can accurately and effectively distinguish and detect the mutation of the CYP21A2 gene and the pseudogene CYP21A1P with high homology by using high-throughput sequencing data, thereby accurately and effectively obtaining the copy number variation and point mutation information of the CYP21A2 gene.

The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于相关性分析的病毒扩散与气候因素关系分析方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!