Detection method and quality control system for homologous recombination defects based on NGS platform

文档序号:1848112 发布日期:2021-11-16 浏览:23次 中文

阅读说明:本技术 一种基于ngs平台的同源重组缺陷的检测方法和质控体系 (Detection method and quality control system for homologous recombination defects based on NGS platform ) 是由 杨元 邓望龙 叶雷 陆光华 丁然 范峰 李诗濛 任用 于 2021-08-20 设计创作,主要内容包括:本发明提供一种生信分析过程中肿瘤纯度矫正方法以及基于NGS平台的同源重组缺陷的检测方法,本发明方法通过比较临床样本与阴性样本在目标区域的测序深度和单核苷酸多态性位点等位基因频率的差异,有效校正肿瘤纯度和倍性,实现HRD评估。(The invention provides a tumor purity correction method in a process of credit generation analysis and a homologous recombination defect detection method based on an NGS platform.)

1. A method for correcting tumor purity in a process of credit generation analysis is characterized by comprising the following steps of credit generation analysis:

1) obtaining NGS sequencing off-line data;

2) analyzing the machine unloading data to obtain the copy number CN of the Backbone area;

3) analyzing the off-line data to obtain SNP allele frequency AF;

4) homozygous SNP removal: the removal is based on SNP mBAF (normalized B Allle frequency) or SUM (triple-SUM);

5) segment segments are merged and Segment of the mBAF deviation is established;

6) tumor purity was assessed based on segement of mBAF bias.

2. The corrective method of claim 1, wherein said removing in step 3) is: when the offline data have no pollution, the removal standard is that mBAF is more than or equal to 0.95 or TSUM is more than or equal to 0.80; when the lower-level data has slight pollution, namely the pollution proportion of the exogenous DNA is less than or equal to 5 percent, the removal standard is that mBAF is more than or equal to 0.90 or TSUM is more than or equal to 0.80.

3. The corrective method of claim 2, characterized in that said mBAF and TSUM in step 3) are calculated by the following formulas:

mBAF=|AF-0.5|+0.5;

TSUMi=|mBAFi-1-mBAFi|+|mBAFi+1-mBAFi|+mBAFi-0.5;

wherein i represents the SNP index which is filtered by mBAF and then is ordered according to the chromosome and the position from small to large.

4. The correction method according to any one of claims 1 to 3, wherein the Segment of the mBAF deviation in the step 5) is a Segment of the mBAF deviation of 0.5 from the detected value of mBAF.

5. The method of correcting recited in claim 4, wherein the step 5) of combining Segment segments is to combine the Backbone regions with similar CN and mBAF on each chromosome arm into Segment segments, and to establish CN and mBAF detection values for Segment segments; preferably, a cyclic binary segmentation algorithm is used for merging the Backbone areas with similar CN and mBAF on each chromosome arm into Segment segments, and the median of the Backbone areas CN and mBAF is taken as the detection values of the Segment CN and mBAF.

6. The corrective method of any of claims 1-5, characterized in that the evaluation in step 6) comprises the steps of:

a) performing two-dimensional clustering on CN and mBAF of the segments to obtain a Cluster (Cluster) consisting of a plurality of segments with similar CN and mBAF, and selecting the Cluster with the most segments according to the sequence from 1 to 5 of the priority of the following table;

b) calculating the theoretical values of CN and mBAF of Segment in Cluster, wherein the theoretical values of CN and mBAF are calculated by the following formulas:

3N=Ploidy×Purity+2×(1-Purity)

wherein Ploid, Purity and nB are the copy number, Purity and inferior allele copy number of tumor cells; wherein the value ranges of Ploid, Purity and nB are shown in the following table;

c) and comparing each Pliody, Purity and nB combination, and calculating the distance between the theoretical value and the detection value of the mBAF, wherein Purity in the combination with the minimum distance is the tumor Purity.

7. The method of remediating as recited in claim 1, wherein the off-line data of step 1) is derived from NGS sequencing off-line data of a probe hybridization capture library;

preferably, the probe is designed as follows: equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream; and (3) taking the externally amplified 40-80bp length of each screened SNP site as a Backbone region, and designing a corresponding probe aiming at the Backbone region.

8. A method for detecting defects in homologous recombination based on NGS platform, comprising the method of any one of claims 1 to 7, and further comprising the steps of:

7) correcting the copy number of all segments using the method for assessing tumor purity in step 6);

8) respectively calculating scores according to the three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) and visualizing and drawing an AF scatter diagram of copy number of the Backbone region and SNP.

9. A detection system for homologous recombination defects based on an NGS platform is characterized by comprising the following modules:

1) obtaining an NGS sequencing off-line data module;

2) analyzing the machine unloading data to obtain a backhaul area copy number CN module;

3) analyzing the off-line data to obtain an SNP allele frequency AF module;

4) homozygous SNP removal module: the removal is based on SNP mBAF (normalized B Allle frequency) or SUM (triple-SUM);

5) segment modules that merge segments and establish an mBAF offset;

6) a Segment evaluation tumor purity module based on the mBAF deviation;

7) a copy number correction module: segment copy number correction using the method of assessing tumor purity of 6);

8) HRD score calculation module: respectively calculating scores according to the three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) a visualization module: drawing a copy number of a Backbone region and an AF scatter diagram of SNP;

said modules 1) -9) performing the steps 1) -9) of the above claims 1-7, respectively).

10. A sequencing library construction method is characterized by comprising the following steps:

1) equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream;

2) selecting the externally-amplified 40-80bp length of each SNP locus as a Backbone region, and designing a corresponding probe aiming at the Backbone region;

3) a sequencing library was constructed based on the probes.

Technical Field

The invention belongs to the field of biological information analysis, and particularly relates to a homologous recombination defect detection method (HRDkit) and a quality control system based on an NGS platform.

Background

The Homologous Recombination Defect (HRD) refers to functional defect of Homologous Recombination pathway for repairing DNA double strand break caused by BRCA1/2 gene variation, promoter methylation, genetic variation, etc., and genome instability is the physical expression of HRD. HRD can lead to genomic scarring phenomena including Loss of Heterozygosity (LOH), Telomere Allelic Imbalance (TAI), and Large fragment migration (LST). The published Myriad's mycchoice HRD test combined LOH, TAI, LST scores, and the score ≧ 42 or the (suspected) deleterious variation carrying the BRCA1/2 gene was defined as positive for HRD.

LOH refers to the state that a region of homologous chromosome is from sister chromatid deletion of father (mother) side, and all heterozygous Single Nucleotide Polymorphism Sites (SNP) of the region are changed into homozygosity. LOH is classified into LOH with missing copy number and LOH with normal copy number according to the status of copy number. TAI refers to an allelic imbalance that extends to telomeres but does not span the centromeric region. The ratio of the copy number of the allele in the normal case is 1:1, and after the copy number amplification has occurred, the ratio of the copy number of the allele may become 2:1, 3:1, etc., and LOH is a specific TAI whose ratio of the copy number of the allele is 1:0 or 2: 0. LST refers to a large fragment structure with the length of less than or equal to 3M (Megabase) on a filter genome, the lengths of adjacent regions are more than or equal to 10M, the copy numbers are different, and the distance between the adjacent regions is less than or equal to 3M.

HRD positive tumor cells are sensitive to PARP inhibitors (Poly ADP-ribose polymerase inhibitors, PARPi), and several PARP inhibitors have been approved for sale in China and the United states. PARP inhibitors induce apoptosis of tumor cells by means of "synthetic lethality". The PARP protein participates in the repair of DNA single-chain damage, in HRD positive tumor cells, the PARP inhibitor blocks the repair of the DNA single-chain damage, the DNA single-chain damage is accumulated to gradually form DNA double-chain damage, and the DNA double-chain damage cannot be repaired due to the functional defect of a homologous recombination pathway, so that the tumor cells are apoptotic; in normal cells, the homologous recombination pathway functions normally, DNA double-strand damage can be repaired, and the cells survive.

HRD measures genomic instability in tumor cells, whereas clinical samples of tumors usually contain a fraction of normal cells (i.e., tumor purity < 100%), and without using tumor purity to correct LOH, TAI and LST scores, the resulting HRD score is that of tumor cells and mixed cells, which does not truly reflect the HRD status of tumor cells. When the tumor purity becomes lower gradually, CNV (Copy Number Variation) with low Copy Number in tumor cells is affected by normal cell dilution, so that the Copy Number gradually approaches to a normal state with 2 copies, the HRD score of the sample is reduced, and the accuracy of the HRD detection result of the clinical sample is affected. At present, a histopathology assessment method is generally used for assessing tumor purity of tumor tissues, the histopathology assessment method relates to a complicated experimental process and has strong detection subjectivity, and in addition, the pathology assessment method cannot be routinely used for assessing tumor purity of part of tumor tissues, so that the reliability of HRD detection is influenced. On the other hand, currently, the mainstream software for detecting the tumor purity based on the NGS platform is PureCN and ABSOLUTE, but the trust algorithm software has limitations in accuracy and application.

In the actual clinical sample detection process, the accuracy and reliability of HRD detection can be influenced by a plurality of sample quality factors and the experimental process, and the main manifestations are as follows: 1) the HRD detection accuracy of a tumor clinical sample can be influenced by the pollution (human source pollution) of other samples in the links of sampling, transportation, experiment and the like; 2) similar to NGS platform detection of single nucleotide variation/small fragment insertion deletion, HRD score is also affected by sequencing depth, and a decrease in sequencing depth affects the stability of HRD detection results.

In summary, the accuracy of HRD detection of tumor samples is affected by tumor purity, human contamination and sequencing depth, so it is necessary to develop an HRD detection method and a quality control system based on the NGS platform to ensure the accuracy of HRD detection and establish a quality control standard suitable for the detection system.

The invention is provided in view of the above.

Disclosure of Invention

The invention aims to improve the accuracy of HRD detection of a tumor sample. In order to achieve the above object, the present invention specifically provides the following technical solutions.

The invention firstly provides a sequencing library construction method, which comprises the following steps:

1) equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream;

2) selecting the externally-amplified 40-80bp length of each SNP locus as a Backbone region, and designing a corresponding probe aiming at the Backbone region;

3) a sequencing library was constructed based on the probes.

The invention also provides a tumor purity correction method in the process of letter generation analysis, which comprises the following letter generation analysis steps:

1) obtaining NGS sequencing off-line data;

2) analyzing the machine unloading data to obtain the copy number CN of the Backbone area;

3) analyzing the off-line data to obtain SNP allele frequency AF;

4) homozygous SNP removal: the removal is based on SNP mBAF (normalized B Allle frequency) or SUM (triple-SUM);

5) segment segments are merged and Segment of the mBAF deviation is established;

6) tumor purity was assessed based on segement of mBAF bias.

Further, the removing in the step 3) is: when the offline data have no pollution, the removal standard is that mBAF is more than or equal to 0.95 or TSUM is more than or equal to 0.80; when the lower-level data has slight pollution, namely the pollution proportion of the exogenous DNA is less than or equal to 5 percent, the removal standard is that mBAF is more than or equal to 0.90 or TSUM is more than or equal to 0.80;

further, the mBAF and the TSUM in step 3) are calculated by the following formulas:

mBAF=|AF-0.5|+0.5;

TSUMi=|mBAFi-1-mBAFi|+|mBAFi+1-mBAFi|+mBAFi-0.5;

wherein i represents the SNP index which is filtered by mBAF and then is ordered according to the chromosome and the position from small to large.

Further, the Segment of the deviation of the mBAF in the step 5) is a Segment of the deviation of the mBAF detection value of 0.5.

Further, the Segment merging in the step 5) is to merge the Backbone regions with similar CN and mBAF on each chromosome arm into Segment, and establish the CN and mBAF detection values of the Segment;

in some preferred modes, a cyclic binary segmentation algorithm is used to combine the Backbone regions with similar CN and mBAF on each chromosome arm into Segment, and the median of the Backbone regions CN and mBAF is taken as the detection value of Segment CN and mBAF.

Further, the evaluation in step 6) comprises the following steps:

d) performing two-dimensional clustering on CN and mBAF of the segments to obtain a Cluster (Cluster) consisting of a plurality of segments with similar CN and mBAF, and selecting the Cluster with the most segments according to the sequence from 1 to 5 of the priority of the following table;

priority level CN detection value Purity Ploidy nB
1 (0.00,1.80] [0.10,1.00] 1 0、1
3 (1.80,1.95) [0.10,1.00] 1、2 0、1、2
2 [1.95,2.05] [0.10,1.00] 2 0、2
4 (2.05,2.20) [0.10,1.00] 2、3、4、5、6 0、1、2、3、4、5、6
5 [2.20,+∞) [0.10,1.00] 3、4、5、6 0、1、2、3、4、5、6

e) Calculating the theoretical values of CN and mBAF of Segment in Cluster, wherein the theoretical values of CN and mBAF are calculated by the following formulas:

3N=Ploidy×Purity+2(1-Purity)

wherein Ploid, Purity and nB are the copy number, Purity and inferior allele copy number of tumor cells;

wherein the value ranges of Ploid, Purity and nB are shown in the following table;

f) and comparing each Pliody, Purity and nB combination, and calculating the distance between the theoretical value and the detection value of the mBAF, wherein Purity in the combination with the minimum distance is the tumor Purity.

Further, the off-machine data of the step 1) is from NGS sequencing off-machine data of a probe hybridization capture library;

in some preferred forms, the probe is designed as follows: equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream; selecting the externally-amplified 40-80bp length of each SNP locus as a Backbone region, and designing a corresponding probe aiming at the Backbone region;

in some more preferred forms, the probe is designed as follows: equally dividing each chromosome into non-overlapping regions according to the length of 50kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 60bp region at the upstream and downstream; and (3) externally amplifying the length of 60bp around each screened SNP locus to serve as a Backbone region, and designing a corresponding probe aiming at the Backbone region.

The invention also provides a homologous recombination defect detection method based on an NGS platform, which is characterized by comprising the method of the claim and further comprising the following steps:

7) correcting the copy number of all segments using the method for assessing tumor purity in step 6);

8) respectively calculating scores according to the three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) and visualizing and drawing an AF scatter diagram of copy number of the Backbone region and SNP.

The invention also provides a system for correcting the tumor purity in the process of the biography analysis, which comprises the following modules:

1) obtaining an NGS sequencing off-line data module;

2) analyzing the machine unloading data to obtain a backhaul area copy number CN module;

3) analyzing the off-line data to obtain an SNP allele frequency AF module;

4) homozygous SNP removal module: the removal is based on SNP mBAF (normalized B Allle frequency) or SUM (triple-SUM);

5) segment modules that merge segments and establish an mBAF offset;

6) a Segment evaluation tumor purity module based on the mBAF deviation;

said modules 1) -6) performing steps 1) -6) of the above claims 1-7, respectively).

Further, the removal in the module 3) is: when the offline data have no pollution, the removal standard is that mBAF is more than or equal to 0.95 or TSUM is more than or equal to 0.80; when the lower-level data has slight pollution, namely the pollution proportion of the exogenous DNA is less than or equal to 5 percent, the removal standard is that mBAF is more than or equal to 0.90 or TSUM is more than or equal to 0.80;

further, the mBAF and the TSUM in the module 3) are calculated by the following formulas:

mBAF=|AF-0.5|+0.5;

TSUMi=|mBAFi-1-mBAFi|+|mBAFi+1-mBAFi|+mBAFi-0.5;

wherein i represents the SNP index which is filtered by mBAF and then is ordered according to the chromosome and the position from small to large.

Further, the Segment of the deviation of the mBAF in the module 5) is a Segment of deviation of the mBAF detection value by 0.5.

Further, the Segment merging in the module 5) is to merge the Backbone regions with similar CN and mBAF on each chromosome arm into Segment, and establish the CN and mBAF detection values of the Segment;

in some preferred modes, a cyclic binary segmentation algorithm is used to combine the Backbone regions with similar CN and mBAF on each chromosome arm into Segment, and the median of the Backbone regions CN and mBAF is taken as the detection value of Segment CN and mBAF.

Further, the evaluation in module 6) comprises the steps of:

a) performing two-dimensional clustering on CN and mBAF of the segments to obtain a Cluster (Cluster) consisting of a plurality of segments with similar CN and mBAF, and selecting the Cluster with the most segments according to the sequence from 1 to 5 of the priority of the following table;

priority level CN detection value Purity Ploidy nB
1 (0.00,1.80] [0.10,1.00] 1 0、1
3 (1.80,1.95) [0.10,1.00] 1、2 0、1、2
2 [1.95,2.05] [0.10,1.00] 2 0、2
4 (2.05,2.20) [0.10,1.00] 2、3、4、5、6 0、1、2、3、4、5、6
5 [2.20,+∞) [0.10,1.00] 3、4、5、6 0、1、2、3、4、5、6

b) Calculating the theoretical values of CN and mBAF of Segment in Cluster, wherein the theoretical values of CN and mBAF are calculated by the following formulas:

CN=Ploidy×Purity+2×(1-Purity)

wherein Ploid, Purity and nB are the copy number, Purity and inferior allele copy number of tumor cells; wherein the value ranges of Ploid, Purity and nB are shown in the following table;

c) and comparing each Pliody, Purity and nB combination, and calculating the distance between the theoretical value and the detection value of the mBAF, wherein Purity in the combination with the minimum distance is the tumor Purity.

Further, the off-machine data of the module 1) is from NGS sequencing off-machine data of a probe hybridization capture library;

in some preferred forms, the probe is designed as follows: equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream; and (3) taking the externally amplified 40-80bp length of each screened SNP site as a Backbone region, and designing a corresponding probe aiming at the Backbone region.

The invention also provides a system for detecting the homologous recombination defect based on the NGS platform, which is characterized by comprising the modules and further comprising the following modules:

7) a copy number correction module: segment copy number correction using the method of assessing tumor purity of 6);

8) HRD score calculation module: respectively calculating scores according to the three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) a visualization module: and drawing an AF scatter diagram of copy number of the Backbone region and SNP.

The invention also provides a device for detecting the homologous recombination defect based on the NGS platform, which is characterized by comprising the following components: at least one memory for storing a program; at least one processor configured to load the program to perform the above method.

The present invention also provides a storage medium having stored therein processor-executable instructions, characterized in that the processor-executable instructions, when executed by a processor, are adapted to implement the above method.

Compared with the prior art, the invention has at least the following advantages:

(1) the invention develops a brand-new homologous recombination defect detection method (HRDkit) and system based on the NGS platform;

(2) the method constructs an accurate tumor purity evaluation method, corrects the accuracy of HRD detection, and solves the problem that part of samples cannot be subjected to pathological evaluation and HRD detection;

(3) the invention constructs a quality control system for HRD detection, determines the LOD and the lowest sequencing depth of the tumor, allows the HRD detection of samples with 5% pollution proportion, and solves the problem of light pollution of the samples in actual detection.

(4) The method has good detection limit and sequencing depth.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic drawing of the Panel design;

FIG. 2 HRDkit analysis flow chart;

FIG. 320 mBAF distribution map of SNP sites in negative sample;

FIG. 420 TSUM distribution plots of total SNP sites of negative samples;

FIG. 5 establishment of mBAF threshold for lightly contaminated samples;

figure 6 simulates HRD scores of contaminated sample 1 before and after adjustment of the mBAF threshold;

figure 7 HRD scores before and after adjustment of the mBAF threshold for mock contaminated sample 2;

figure 8 simulates HRD scores of contaminated sample 3 before and after adjustment of the mBAF threshold;

figure 9 simulates HRD scores of contaminated sample 4 before and after adjustment of the mBAF threshold;

FIG. 10 detected and expected values for tumor purity (HRDkit);

FIG. 11 measured and expected values of tumor purity (PureCN);

FIG. 12 detection of tumor purity and histopathological assessment (HRDkit);

FIG. 13 detection of tumor purity and histopathological assessment (PureCN);

figure 14 clinical sample HRD score distribution;

figure 15 HRD score distribution at different tumor purities;

FIG. 16 HRD scores at different sequencing depths (300x vs. raw);

figure 17 HRD scores at different sequencing depths (250x vs raw);

figure 18 HRD scores at different sequencing depths (200x vs raw).

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following terms or definitions are provided only to aid in understanding the present invention. These definitions should not be construed to have a scope less than understood by those skilled in the art.

Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present invention are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.

As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.

Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.

The terms "about" and "substantially" in the present invention denote an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The method for correcting the tumor purity in the process of the credit production analysis generally comprises the following analysis steps (as shown in figure 2):

1) obtaining NGS sequencing off-line data;

2) analyzing the machine unloading data to obtain the copy number CN of the Backbone area;

3) analyzing the off-line data to obtain SNP allele frequency AF;

4) homozygous SNP removal: the removal is based on SNP mBAF (normalized B Allle frequency) or SUM (triple-SUM);

5) segment segments are merged and Segment of the mBAF deviation is established;

6) tumor purity was assessed based on segement of mBAF bias.

In some embodiments, the off-machine data of step 1) is derived from NGS sequencing off-machine data of a probe hybridization capture library, and it is understood that the acquisition of such probe hybridization capture library can be a conventional acquisition manner in the art, and those skilled in the art can design and acquire the probe hybridization capture library on the basis of satisfying the needs of the present invention; without limitation, some specific examples of probe designs are as follows:

in some embodiments, the probe design is as follows: equally dividing each chromosome into non-overlapping regions according to the length of 40-60kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 40-80bp region at the upstream and downstream; selecting the externally-amplified 40-80bp length of each SNP locus as a Backbone region, and designing a corresponding probe aiming at the Backbone region;

in some preferred embodiments, the probe design is as follows: equally dividing each chromosome into non-overlapping regions according to the length of 50kp, and screening out SNP loci with the closest population frequency to 50% in each region; meanwhile, the SNP locus is from a region which is not repeated in a genome and has normal GC content in each 60bp region at the upstream and downstream; and (3) externally amplifying the length of 60bp around each screened SNP locus to serve as a Backbone region, and designing a corresponding probe aiming at the Backbone region.

In some embodiments, the removing in step 3) is: the removal standard is mBAF more than or equal to 0.95 or TSUM more than or equal to 0.80. In addition, experiments prove that when the current machine data has slight pollution, namely the pollution proportion of exogenous DNA is less than or equal to 5%, the removal standard needs to be adjusted, wherein the removal standard is that mBAF is more than or equal to 0.90 or TSUM is more than or equal to 0.80;

in some embodiments, the Segment of the mBAF deviation in step 5) is a Segment of an mBAF detection value deviation of 0.5.

In some embodiments, said combining Segment in step 5) is to combine the Backbone region with similar CN and mBAF on each chromosome arm into Segment, and establish CN and mBAF detection values of Segment;

the CN and mBAF detection values are obtained here, and the field may use a cyclic binary segmentation algorithm to merge the Backbone region with CN and mBAF both similar on each chromosome arm into Segment, and take the median of the Backbone region CN and mBAF as the CN and mBAF detection values of Segment.

In some embodiments, the evaluating in step 6) comprises the steps of:

g) performing two-dimensional clustering on CN and mBAF of the segments to obtain a Cluster (Cluster) consisting of a plurality of segments with similar CN and mBAF, and selecting the Cluster with the most segments according to the sequence from 1 to 5 of the priority of the following table;

priority level CN detection value Purity Ploidy nB
1 (0.00,1.80] [0.10,1.00] 1 0、1
3 (1.80,1.95) [0.10,1.00] 1、2 0、1、2
2 [1.95,2.05] [0.10,1.00] 2 0、2
4 (2.05,2.20) [0.10,1.00] 2、3、4、5、6 0、1、2、3、4、5、6
5 [2.20,+∞) [0.10,1.00] 3、4、5、6 0、1、2、3、4、5、6

h) Calculating the theoretical values of CN and mBAF of Segment in Cluster, wherein the theoretical values of CN and mBAF are calculated by the following formulas:

CN=Ploidy×Purity+2×(1-Purity)

wherein Ploid, Purity and nB are the copy number, Purity and inferior allele copy number of tumor cells; wherein the value ranges of Ploid, Purity and nB are shown in the following table;

i) and comparing each Pliody, Purity and nB combination, and calculating the distance between the theoretical value and the detection value of the mBAF, wherein Purity in the combination with the minimum distance is the tumor Purity.

It is understood that after correction of the tumor purity by the confidence, the invention can further perform the detection purpose, i.e. the homologous recombination defect detection method based on the NGS platform, further comprises the following steps:

7) correcting the copy number of all segments using the method for assessing tumor purity in step 6);

8) respectively calculating scores according to the three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) and visualizing and drawing an AF scatter diagram of copy number of the Backbone region and SNP.

It is understood in the art that in practical applications, the method may be used for both diagnostic purposes, i.e. for risk assessment by assessing HRD; or for non-diagnostic purposes, such as scientific applications, analytical applications in non-clinical studies, and the like.

Specific examples are as follows.

Example 1 Panel design of the invention

The Panel design method of this embodiment is as follows, and is exemplary, as shown in fig. 1.

1) Screening high-frequency SNP sites of east Asia population which have normal GC content in genome non-repetitive regions and upstream and downstream 60bp regions and can be captured by a probe;

2) equally dividing each chromosome into non-overlapping regions according to the length of 50000bp, and screening SNP loci with the closest crowd frequency of 50% in each region;

3) and externally amplifying the SNP sites obtained in the previous step by 60bp length to serve as a Backbone (Backbone) region, and designing a corresponding probe aiming at the Backbone.

Through the design, 5.4 ten thousand Backbones and 8 ten thousand SNP sites are finally obtained.

Example 2 establishment of the detection method (HRDkit) of the invention

Illustratively, as shown in FIG. 2, the data analysis of the present invention is divided into the following steps:

1) sequencing on an NGS platform by using the probe library to obtain off-line original data;

2) obtaining CN in the backhaul region by using copy number variation analysis software for the off-line data;

3) obtaining the allele frequency AF of the SNP by using single nucleotide variation analysis software according to off-line data;

4) removing homozygous SNP, calculating mBAF (normalized B Allole frequency) or TSUM (triple-SUM) of SNP by using the following formula, and removing SNP if mBAF is more than or equal to 0.95 or TSUM is more than or equal to 0.80, wherein i represents SNP index which is filtered by mBAF and is sorted according to chromosome and position from small to large.

When the sample is contaminated with foreign DNA, mBAF is affected resulting in incomplete filtration of homozygous SNPs: the clinical sample homozygous SNP site mBAF is 1, and if the exogenous DNA is mBAF 0.50 (heterozygous) at the corresponding SNP site, the mBAF at the clinical sample homozygous SNP site is reduced along with the increase of pollution degree. If the sample has slight pollution (the pollution proportion is less than or equal to 5 percent), adjusting the mBAF threshold value to be 0.90 can ensure that the homozygous SNP is removed;

mBAF=|AF-0.5|+0.5

TSUMi=|mB Fi-1-mBAFi|+|mBAFi+1-mBAFi|+mBAFi-0.5

5) combining adjacent Backbone areas with similar CN and mBAF on each chromosome arm into a Segment (Segment) by using a Circular Binary Segmentation (CBS) algorithm, and taking the median of the Backbone areas CN and mBAF as the detection values of the Segment CN and mBAF;

6) biased Segment assessment tumor purity: judging whether the mBAF of each Segment deviates from 0.5 by using a Kernel Density Estimation (KDE) algorithm, and screening the segments of which the mBAF deviates from 0.5 for evaluating the tumor purity, wherein the method specifically comprises the following steps:

(1) performing two-dimensional clustering on CN and mBAF of the segments to obtain a Cluster (Cluster) consisting of a plurality of segments with similar CN and mBAF, and selecting the Cluster with the most segments according to the sequence from 1 to 5 of the priority of the following table;

(2) calculating theoretical values of CN and mBAF of Segment in Cluster, wherein the theoretical values of CN and mBAF are calculated by the following formulas:

3N=Ploidy×Puri y+2×(1-Purity)

wherein Ploid, Purity and nB are the copy number, Purity and inferior allele copy number of tumor cells; wherein the value ranges of Ploid, Purity and nB are shown in the following table;

(3) comparing the distance (difference absolute value) between the theoretical value and the detection value of mBAF calculated by each Ploid, Purity and nB combination, wherein Purity in the minimum distance combination is the tumor Purity;

table: values of Ploid, Purity and nB in the range of CN detection values of different segments

Priority level CN detection value Purity Ploidy nB
1 (0.00,1.80] [0.10,1.00] 1 0、1
3 (1.80,1.95) [0.10,1.00] 1、2 0、1、2
2 [1.95,2.05] [0.10,1.00] 2 0、2
4 (2.05,2.20) [0.10,1.00] 2、3、4、5、6 0、1、2、3、4、5、6
5 [2.20,+∞) [0.10,1.00] 3、4、5、6 0、1、2、3、4、5、6

7) The method of tumor purity assessment was used to correct the copy number CN of all segments;

8) calculating the score according to three index definitions of LOH, TAI and LST, wherein the sum of the three is HRD score;

9) and visualizing and drawing an AF scatter diagram of copy number of the Backbone region and SNP.

In the above method establishment, the optimization establishment process of partial parameters is as follows, which is only exemplified:

a. mBAF and TSUM threshold establishment in step 4)

Since no copy number variation was present in the negative samples, all SNP sites were heterozygous (AF ═ 0.5, mBAF ═ 0.5) or homozygous (AF ═ 0 or 1, mBAF ═ 1). The threshold was determined by calculating the mBAF and TSUM values for all SNP sites of 20 negative samples. Distribution of mBAF As shown in FIG. 3, SNP sites were concentrated around 0.5 (heterozygous) and 1 (homozygous), and thus setting the threshold of mBAF to 0.95 removed the homozygous SNP sites effectively. As shown in fig. 4, when the SNP sites and the left and right adjacent SNP sites are both heterozygous, the corresponding TSUM values are distributed around 0.5, and when the SNP sites are homozygous and the left and right adjacent SNP sites are all heterozygous, the corresponding TSUM values are distributed around 0.95 and 1.45, so that setting the threshold of TSUM to 0.80 can effectively remove the homozygous SNP sites.

b. mBAF threshold establishment for lightly contaminated samples

However, when the sample is contaminated with exogenous DNA, mBAF is affected resulting in incomplete filtration of homozygous SNPs (as shown in table 1):

to verify the accuracy of the HRD scores of lightly contaminated samples, the present invention simulated samples at 1%, 2%, 3%, 4%, 5% contamination ratios using clinical samples, analyzed the clinical samples using the herein established mBAF thresholds of 0.95 and 0.90, respectively, and compared the HRD scores and status at different thresholds.

The establishment process of the mBAF threshold of the light pollution sample is as follows, the mBAF threshold analysis of 0.95, 0.94, 0.93, 0.92, 0.91 and 0.90 is respectively used for the sample simulating the pollution proportion of 1 percent to 5 percent, HRD scores and states of different mBAF thresholds are compared, and the mBAF threshold corresponding to the condition that the HRD scores have the minimum fluctuation among different pollution proportions and the HRD states have not changed is selected as the mBAF threshold of the light pollution sample. Results as shown in fig. 5, for the lightly contaminated samples, when the threshold for mBAF was set at 0.90, the HRD scores were minimally fluctuating between different contamination ratios, and the HRD status was all changed.

Also, as shown in FIGS. 6-9, when the contamination ratio is ≦ 5%, adjusting the threshold value of mBAF (0.90) may result in a reduced difference in HRD scores between different contamination ratios for the same sample, and all contaminated samples will be completely consistent in the adjusted HRD status versus the uncontaminated condition. Thus, processing of the contaminated sample may ensure accuracy of the HRD score and HRD status.

Example 3 comparison of tumor purity evaluation methods (test of effectiveness of the present invention)

In order to verify the accuracy of the tumor purity evaluation method, a cell line with 100% tumor purity and a matched sample thereof are mixed according to different proportions, the tumor purity is diluted to 95%, 90%, 80%, 30% and 20%, the HRDkit of the invention and PureCN in the prior art are respectively used for analyzing the tumor purity, the consistency of the detection value and the expected value of the tumor purity is compared, the HRDkit and the PureCN are respectively used for analyzing the tumor purity of a clinical sample subjected to histopathological evaluation, and the consistency of the detection value of the tumor purity and the result of the histopathological evaluation is compared.

The results of the analysis are shown in FIGS. 10-13, where the correlation R between HRDkit and expected values299.15%, correlation R with histopathological evaluation297.14%; pureCN correlation with expected value R238.68%, correlation R with histopathological evaluation2The content was 43.02%.

In summary, the HRDkit assay results of the invention outperformed PureCN in both cell lines and clinical samples, with high correlation to expected values or results of histopathological evaluation.

Example 4 Performance validation of homologous recombination Defect detection (HRD score, minimum detection Limit, etc.)

1) HRD score threshold

Based on the tumor purity assessment method of the present invention, in order to determine the HRD score threshold, the HRD score of clinical samples is shown in fig. 14, and 196 clinical samples were used to determine the HRD score threshold, wherein 77 BRCA positive samples (carrying BRCA deleterious or suspected deleterious variations) and 119 BRCA negative samples. To ensure that 95% of BRCA positive samples were HRD positive, the HRD score 40 for the 5 th percentile of BRCA positive samples was used as the threshold. The evaluation standard of the HRD state is that the HRD score is more than or equal to 40 or the BRCA1/2 gene (suspected) carries harmful variation.

2) Detection limit of the invention based on the homologous recombination defect detection of the invention

The lower the tumor purity of the sample, the closer to 2 CN and 0.5 mBAF the copy number variation segment occurred in the sample. When tumor purity is below a certain threshold, low copy number LOH and TAI become normal segments, reducing HRD score of the sample. To determine the lower Limit of Detection (LOD) of tumor purity, cell lines with 100% tumor purity and their matched samples were mixed at different ratios, and the tumor purity was diluted to 95%, 90%, 80%, 30%, 20%, and the difference in HRD scores at different tumor purities was compared, as shown in FIG. 15, the HRD score at 20% tumor purity was significantly different from the HRD scores at other tumor purities, so the LOD of tumor purity was 30%.

Similar to NGS platform detection of single nucleotide variations/small fragment indels, HRD scores are also affected by sequencing depth. To determine the lowest sequencing depth, 196 clinical samples were subjected to down sample to 300x, 250x, and 200x, respectively, and the HRD scores were compared for different sequencing depths, as shown in fig. 16-18, the correlation between the HRD scores before and after down sample decreased with decreasing sequencing depth. The lowest sequencing depth is determined by whether the HRD state of the samples changes, the HRD state of all samples is kept unchanged from descending sample to 300x, the HRD state of 6 samples is changed from descending sample to 250x, and the HRD state of 7 samples is changed from descending sample to 200x, so the lowest sequencing depth is 300 x.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于核酸质谱平台体细胞突变超敏检测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!