Genome structure variation performance detection method based on reference set

文档序号：1818154 发布日期：2021-11-09 浏览：21次中文

阅读说明：本技术 一种基于基准集的基因组结构变异性能检测方法 (Genome structure variation performance detection method based on reference set ) 是由朱晓雷宇孟悦边奕心赵松丁云鸿李玉霞于 2021-09-06 设计创作，主要内容包括：一种基于基准集的基因组结构变异性能检测方法,本发明涉及基于基准集的基因组结构变异性能检测方法。本发明的目的是为了解决现有基因组结构变异检测方法不够全面,且缺少公用的变异识别结果检测方法的问题。一种基于基准集的基因组结构变异性能检测方法具体过程为：步骤一、基于用户变异识别结果集和基准集,计算基因组结构变异中插入、缺失、复制、倒位变异在数量指标上的变异统计结果；步骤二、基于用户变异识别结果集和基准集,计算基因组结构变异中易位变异识别结果中断点区间的数量指标。本发明用于基因组结构变异性能检测领域。(The invention discloses a genome structure variation performance detection method based on a reference set, and relates to a genome structure variation performance detection method based on a reference set. The invention aims to solve the problems that the existing genome structural variation detection method is not comprehensive enough and a common variation recognition result detection method is lacked. A genome structure variation performance detection method based on a reference set comprises the following specific processes: calculating variation statistical results of insertion, deletion, duplication and inversion variation in the genome structural variation on the quantitative index based on the user variation identification result set and the reference set; and secondly, calculating the quantity index of break point intervals in the translocation variation recognition result in the genome structure variation based on the user variation recognition result set and the reference set. The invention is used for the field of detecting the variation performance of genome structure.)

1. a genome structure variation performance detection method based on a reference set is characterized in that: the method comprises the following specific processes:

calculating variation statistical results of insertion, deletion, duplication and inversion variation in the genome structural variation on the quantitative index based on the user variation identification result set and the reference set;

the variation statistics of the number indexes of insertion, deletion, replication and inversion variation in the genome structural variation comprise:

invalid variant sets with variant lengths of more than 100kb are inserted, deleted, copied or inverted by users; removing the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; and recall, accuracy, F₁ score；

The user inserts, deletes, copies or inverts the variation number of the variation interval after removing the invalid variation with the variation length larger than 100kb, the variation number of the variation interval without removing the invalid variation with the variation length larger than 100kb and the variation number of the insert, delete, copy or invert variation interval in the reference set;

secondly, calculating the quantity index of break point intervals in the translocation variation recognition result in the genome structure variation based on the user variation recognition result set and the reference set;

the index of the number of breakpoint intervals in the translocation variation recognition result comprises:

the number of true positives and the number of false positives of breakpoint intervals in the translocation variation recognition result of the user; the number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁ score；

And the user translocation variation recognition result is a breakpoint interval set, and the translocation variation breakpoint interval set in the reference set.

2. The method for detecting the genome structure variation performance based on the reference set of claim 1, wherein: calculating variation statistical results of insertion, deletion, duplication and inversion variation in the genome structural variation on the quantity index based on the user variation identification result set and the reference set in the first step; the specific process is as follows:

insertions, deletions, duplications, inverted mutations are represented as the five-tuple Region ═ (Chr, Start, End, Type, Size);

wherein, Chr is the number of the chromosome where the mutation is located, Start and End are the initial and End positions of the mutation on the chromosome Chr respectively, Type is the Type of the mutation, the insertion, deletion, duplication or inversion is taken, and Size is the Size of the mutation;

given a user insertion, deletion, duplication or inversion mutation recognition result set S₁And a reference set S₂；

Calculating an invalid mutation set with the mutation length of user insertion, deletion, duplication or inversion larger than 100 kb;

removing invalid mutations with user insertion, deletion, duplication or inversion mutation length larger than 100 kb;

calculating the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after removing the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; and recall, accuracy, F₁ score；

Calculating the variation number of the variation intervals after the invalid variation with the variation length larger than 100kb is removed in the identification result set of the user insertion, deletion, copying or inversion variation, the variation number of the variation intervals without the invalid variation with the variation length larger than 100kb and the variation number of the insertion, deletion, copying or inversion variation intervals in the reference set;

inserting, deleting, copying or inverting a user into an invalid mutation set with the mutation length of more than 100 kb; removing the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; recall, accuracy, F₁score; the user inserts, deletes, copies or inverts the variation identification result to remove the variation number of the variation interval with the variation length larger than 100kb after the invalid variation and not remove the variation number of the variation interval with the variation length larger than 100kb of the invalid variation; and a referenceAnd recording the variation number of the centralized insertion, deletion, duplication or inversion variation interval into a file.

3. The method for detecting the genome structure variation performance based on the reference set of claim 2, wherein: calculating the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after removing the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; and recall, accuracy, F₁score; the specific process is as follows:

(1) traverse S₁Removing invalid variation with variation length larger than 100kb to obtain variation set S 'with variation length meeting requirements'₁；

(2) Statistic S'₁And S₂The number of variation corresponding to each variation length Size in the variation lengths of not less than 0kb and not more than 2kb in the two sets is stored as a binary form (Size, Num);

(3) is prepared from S'₁Each variant Region of (1)_iAnd S₂Each variant Region of (1)_jMaking pairwise comparison, and comparing regions_iAnd Region_jComputing and overlapping after respectively expanding extSize bases at two ends, and respectively carrying out comparison on expanded regions_iAnd Region_jMarking true if there is an overlap; otherwise, marking as false;

(4) respectively count S'₁And S₂The number of overlapping marks in (1) is recorded as true positive number TP _ user of the identification result of the user insertion, deletion, duplication or inversion mutation after removing the invalid mutation and true negative number TP _ benchmark of the identification result of the insertion, deletion, duplication or inversion mutation in the reference set;

(5) respectively count S'₁And S₂The number of overlapping marks in (1) is false, and the number is respectively marked as the number of false positives FP of the identification result of the user insertion, deletion, duplication or inversion mutation after removing the invalid mutation and the number of false negatives FN of the identification result of the user insertion, deletion, duplication or inversion mutation in the reference set;

(6) precision of calculation precision Pression,Recall rates recalling and F₁score is as follows:

4. the method for detecting the genome structure variation performance based on the reference set of claim 3, wherein: s 'in the above-mentioned (3)'₁Each variant Region of (1)_iAnd S₂Each variant Region of (1)_jMaking pairwise comparison, and comparing regions_iAnd Region_jAfter the two ends are respectively expanded by 100bp, the calculation and the overlapping are carried out, and the expanded regions are respectively carried out_iAnd Region_jMarking true if there is an overlap; otherwise, marking as false; the specific process is as follows:

taking S'₁Each variant Region of (1)_i＝(Chr,Star_i,End_i,Type_i,Size_i) Go through the reference set S₂For any variant Region in the reference set_j＝(Chr,Start_j,End_j,Type_j,Size_j) If Region_iAnd Region_jFrom the same chromosome Chr, in Region_iAnd Region_jExtended by extSize bases at both ends of the same Region of the sequence to obtain an extended new Region'_i＝(Chr,Start'_i,End'_i,Type_i,Size_i) And Region'_j＝(Chr,Start'_j,End'_j,Type_j,Size_j) (ii) a If Region_iAnd Region_jFrom a different chromosome Chr, then do not proceedPerforming line comparison;

if Region'_iAnd Region'_jA variant Region is said to be a Region where at least 1 base overlaps with another base_iThe mark true is identified; otherwise, they are called Region_iNot identified as false.

5. The method of claim 4, wherein the genome structure variation performance detection method based on the reference set comprises: if Region'_iAnd Region'_jIf at least 1 base is overlapped, the two bases are overlapped, and the gravity center distance deviation and the length ratio of the variation interval of the two bases are calculated;

if Region'_iAnd Region'_jIf no overlap exists, no calculation is needed;

the specific process is as follows:

the center of gravity of the variation Region ═ (Chr, Start, End, Type, Size) is defined as follows:

if Region'_iS and Region of'_jIf there is an overlap between the variations t, the barycentric distance deviation between the variation s and the variation interval of the variation t is:

d＝m_s-m_t

in the formula, m_sIs the center of gravity of the variation region s, m_tIs the center of gravity of the variation interval t;

furthermore, if Region'_iVariant 1 and Region of'_jThe variation 2 of (a) overlaps with each other, the variation Region of variation 1 is defined₇＝(Chr₇,Start₇,End₇,Type,Size₇) Variation Region with variation 2₈＝(Chr₈,Start₈,End₈,Type,Size₈) The variation interval length ratio of (2) is as follows:

wherein Chr₇Chromosome number of variation 1, Chr₈Chromosome number of mutation 2, Start₇Chromosome Chr as variant 1₇Start position of₈Chromosome Chr as variant 2₈Starting position of (1), End₇Chromosome Chr as variant 1₇End position of (3), End₈Chromosome Chr as variant 2₈Type is mutation Type, Size₇、Size₈Is the variation size;

counting the distribution that the gravity center distance deviation is less than or equal to 1kb in the variation interval; and (5) counting the distribution of the length ratio of the variation intervals.

6. The method for detecting genome structural variation performance based on the reference set of claim 4 or 5, wherein: calculating the quantity index of break point intervals in the translocation variation recognition result in the genome structure variation based on the user variation recognition result set and the reference set; the specific process is as follows:

translocation variant is represented as octave Tra ═ (Chr'₁,Start'₁,End'₁,Chr'₂,Start'₂,End'₂,Type,Size)；

Wherein Chr'₁Number of chromosome before translocation mutation, Chr'₂Start 'being the number of the chromosome on which the translocation varied'₁Is chromosome Chr 'before translocation mutation'₁Start of'₂Is translocating mutated at chromosome Chr'₂From start position of, End'₁Is chromosome Chr 'before translocation mutation'₁End position of (d)'₂Is chromosome Chr 'before translocation mutation'₂Taking translocation TRA and Size as variation Size and 0 as Type as variation Type;

calculating the true positive number and the false positive number of the break point interval in the translocation variation identification result of the user; the number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁ score；

Calculating a breakpoint interval set in a user translocation variation identification result and a breakpoint interval set of translocation variation in a reference set;

a breakpoint interval set in a user translocation variation identification result, a breakpoint interval set of translocation variation in a reference set, and the number of true positives and the number of false positives of the breakpoint interval in the user translocation variation identification result; the number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁score is recorded into a file.

7. The method of claim 6, wherein the genome structure variation performance detection method based on the reference set comprises: calculating the true positive number and the false positive number of break point intervals in the translocation variation identification result of the user; the number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁score; the specific process is as follows:

(1) extracting translocation variant in user variant identification result set to form translocation variant set T of identification result₁；

(2) For T₁Each translocation variant of (Tra ═ (Chr)₁,Start₁,End₁,Chr₂,Start₂,End₂TRA,0) to construct a breakpoint interval Region₃、Region₄、Region₅And Region₆；

The breakpoint interval Region₃、Region₄、Region₅And Region₆The acquisition process comprises the following steps:

at Chr₁Start of (2)₁Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₃＝(Chr₁,Start₁-extSize,Start₁+extSize,TRA,0)；

At Chr₁End of (2)₁Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₄＝(Chr₁,End₁-extSize,End₁+extSize,TRA,0)；

At Chr₂Start of (2)₂Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₅＝(Chr₂,Start₂-extSize,Start₂+extSize,TRA,0)；

At Chr₂End of (2)₂Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₆＝(Chr₂,End₂-extSize,End₂+extSize,TRA,0)；

Wherein Chr₁The number of the chromosome, Chr, before translocation and mutation is centralized for the user mutation identification result₂The serial number of the chromosome where the translocation mutation is located in the user mutation identification result set, Start₁Gathering translocation variation on chromosome Chr before variation identification result of user variation₁Start position of₂After the translocation variation is centralized for the variation recognition result of the user₂Starting position of (1), End₁Gathering translocation variation on chromosome Chr before variation identification result of user variation₁End position of (3), End₂Gathering translocation variation on chromosome Chr before variation identification result of user variation₂Taking translocation TRA and Size as variation Size and 0 as Type as variation Type;

(3) merging breakpoint intervals Region₃、Region₄、Region₅And Region₆Generating breakpoint interval set T 'of translocation mutation in user mutation identification result set'₁；

(4) Extracting translocation variant in reference set to form translocation variant set T of reference set₂(ii) a Generating a breakpoint interval set T 'of translocation variation in a reference set according to the method in the steps (2) to (3)'₂；

(5) Go through T 'in sequence'₁Record Region of each breakpoint interval in the_xAt T'₂Middle search and Region_xOverlapping breakpoint intervals of at least 1 base, recorded as a set { Region }_y1,Region_y2,…,Region_ymAnd if the set is not empty, the breakpoint interval Region is called_xIs identified and Region is identified_xIs marked as true, let { Region }_y1,Region_y2,…,Region_ymThe corresponding overlap mark in the page is marked as true; if the set is empty, then the breakpoint interval Region is called_xNot identified as false, mark T'₂Marking each break point interval as false;

(6) respectively count T'₁And T'₂The number of the overlapping marks in (1) is recorded as true positive number TP _ BP _ user of the breakpoint interval in the translocation variation identification result of the user and true negative number TP _ BP _ benchmark of the translocation variation breakpoint interval identified in the reference set;

(7) respectively count T'₁And T'₂The number of overlapping marks in (1) is false, and the number is respectively recorded as the false positive number FP _ BP of the breakpoint interval in the translocation variant identification result of the user and the unrecognized false negative number FN _ BP of the translocation variant breakpoint interval in the reference set;

(8) precision Presision for calculating breakpoint intervals_BPRecall rate recalling_BPAnd F₁score is as follows:

Technical Field

The invention relates to a genome structure variation performance detection method based on a reference set.

Background

The research of the human genome structural variation has important significance in the aspects of genome evolution, population polymorphism analysis, pathogenic variation, human health and the like. Variations in the human genome fall into three major categories: (1) single Nucleotide Variation (SNV), colloquially referred to as the difference in a single DNA base; (2) small indels (general term for insertion and deletion) refer to insertions or deletions of small fragment sequences occurring at a certain position of the genome, which are usually below 50bp in length; (3) large structural variations, of various types, including insertions, deletions, chromosomal inversions, intrachromosomal or intrachromosomal sequence translocations of large sequences greater than 50bp in length, and some forms of more complex variations.

To distinguish SNV variants, class 2 (small indels) and class 3 (large structural variants) variants are commonly referred to as genomic structural variants. Structural variations confer greater diversity to the human genome than other forms of genetic variation. Although human genomic structural variation is less prevalent than SNV, it has a greater potential for genetic disease due to its longer length, which allows more bases to be altered, and possibly even genetic structure to be altered.

Major types of genomic structural variation^[1]Comprises the following steps: insertions (insertions), deletions (deletions), duplications (duplications), inversions (inversions), and translocations (translocations). These five types of variation are shown in FIG. 2.

Genome structure variation identification method

Currently, the identification of structural variations is mainly based on high throughput sequencing data and single molecule sequencing data.

Structural variation identification of high-throughput sequencing data, and mainly has 4 types of identification strategies^[2]: (1) read pair-based identification strategy, which is the most common identification strategy, many identification methods are based on, e.g., Break Dancer^[3]，PEMer^[4]Etc.; (2) read depth based recognition strategy, theThe class strategy can effectively detect large variation of replication and deletion types, cannot effectively identify the variation with smaller length, and simultaneously has fewer variation types, and a typical identification method of the class is CNVnator^[5]And the like. (3) Split read-based identification strategy that enables detection of deletion variants and insertion variants of smaller length, such as Pindel^[6]. (4) Based on the identification strategy of sequence splicing, the method can identify various types of variation, is particularly suitable for processing long-length insertion sequences and complex structural variation, and the typical identification method is CREST^[7]。

The identification method for identifying structural variation by aiming at single molecule sequencing data is mainly divided into two methods: methods based on alignment and methods combining local splicing and alignment. The method for identifying the structural variation of the single molecule sequencing data based on the comparison mainly comprises PBHoney^[8]、Sniffles^[9]、NextSV^[10]、NanoSV^[11]And Picky^[12](ii) a The identification method combining splicing and comparison mainly comprises SMRT-SV^[13]And SDA^[14]. Both methods are based on reads alignment between single molecule sequencing data and a reference genome.

Method for detecting structure variation recognition result performance

The identification results of different mutation identification methods are usually greatly different, and although some performance detection methods in the aspect of structural mutation exist, the methods are basically private methods in different research teams, and a common mutation performance detection method is lacked, so that the research of the structural mutation identification method is hindered to a certain extent.

In addition, the variation in some complex regions of the genome due to the difference between the adopted sequence alignment algorithm and the variation identification method may have position shift in different identification results (as shown in fig. 3a and 3 b), whereas the existing detection method usually directly calculates the overlap between the variation in the user identification result set and the variation in the reference set, and does not consider the position shift that may occur in the identification result, so that the detection result may have slight deviation.

The existing genome structure variation performance detection method is generally only counted in the number direction, the variation interval is not analyzed in detail, the analysis is not comprehensive and detailed, and the interval deviation condition between variations cannot be effectively reflected.

Disclosure of Invention

The invention aims to provide a genome structure variation performance detection method based on a reference set, aiming at solving the problems that the existing genome structure variation detection method is not comprehensive enough and a common variation identification result detection method is lacked.

A genome structure variation performance detection method based on a reference set comprises the following specific processes:

the variation statistics of the number indexes of insertion, deletion, replication and inversion variation in the genome structural variation comprise:

the index of the number of breakpoint intervals in the translocation variation recognition result comprises:

true positive number and false positive number of breakpoint interval in user translocation variation recognition result(ii) a The number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁ score；

And the user translocation variation recognition result is a breakpoint interval set, and the translocation variation breakpoint interval set in the reference set.

The invention has the beneficial effects that:

the invention provides a novel mutation performance detection method, which is convenient for researchers to detect and analyze different identification results from different aspects. The detection method can facilitate the data processing and analysis of researchers in the aspect of genome structure variation.

By utilizing the reference set, the SV _ STAT can effectively detect the genome structure variation performance, including the identification variation quantity, the true positive quantity (TP), the false positive quantity (FP), the false negative quantity (FN), the Recall rate (Recall), the Precision (Precision), and the F₁score, etc. and can further perform detailed analysis on interval deviation of variation recognition results. In addition, the SV _ STAT calculates the detection result of the breakpoint interval of translocation variation on the quantity index in detail. The SV _ STAT detection method can provide a more effective and faster detection method for the analysis of the genome structural variation, and accelerate the step of genome analysis.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph of the main types of structural variations (plotted against the study content and relevant literature, solid lines representing single molecule sequencing data);

FIG. 3a is a schematic diagram of a possible position shift 1 of the same mutation in different recognition results;

FIG. 3b is a schematic diagram of a possible position shift 2 of the same mutation in different recognition results;

FIG. 4 is a graph comparing the recognition performance of different methods for insertions and deletions in the human genome chr1, with percentage on the ordinate;

FIG. 5 is a statistical chart of the results of the insertion and deletion of chr1 in human genome, with the abscissa being the size of the SVs region and the ordinate being the number;

FIG. 6a is a comparison graph of the sizes of the intervals for identifying the simulated insertions and deletions of Saccharomyces cerevisiae by the ASVCLR method;

FIG. 6b is a comparison graph of the sizes of intervals for identifying simulated insertions and deletions of Saccharomyces cerevisiae by the Sniffles method;

FIG. 6c is a comparison graph of the sizes of the intervals for identifying the simulated insertion and deletion of Saccharomyces cerevisiae by the Nextsv (positive) method;

FIG. 6d is a comparison graph of the sizes of the regions for identifying simulated insertions and deletions of Saccharomyces cerevisiae by the Nextsv (strand);

FIG. 6e is a comparison graph of interval sizes of simulated insertions and deletions of Saccharomyces cerevisiae identified by the PBHoney-spots method;

FIG. 6f is a comparison graph of interval sizes of simulated insertions and deletions of Saccharomyces cerevisiae identified by the PBHoney-tails method;

FIG. 7a is a graph showing a statistical comparison of different length intervals for identifying Saccharomyces cerevisiae replication type variation (DUP) by the ASVCLR method;

FIG. 7b is a diagram showing the statistical comparison of different length intervals for identifying Saccharomyces cerevisiae replication type variation (DUP) by Nextsv (positive) method;

FIG. 7c is a graph showing the statistical comparison of different length intervals for identifying Saccharomyces cerevisiae replication type variation (DUP) by the Nextsv (strand);

FIG. 7d is a graph showing a statistical comparison of different length intervals for identifying Saccharomyces cerevisiae replication type variation (DUP) by the Sniffles method;

FIG. 7e is a graph showing the statistical comparison of different length intervals for identifying Saccharomyces cerevisiae replication type variation (DUP) by the PBHoney-spots method;

FIG. 7f is a graph showing the statistical comparison of different length intervals of the replication type variation (DUP) of Saccharomyces cerevisiae identified by different methods of PBHoney-tails.

Detailed Description

The first embodiment is as follows: the method for detecting the genome structure variation performance based on the reference set comprises the following specific processes:

the invention provides a public method for detecting the structural variation performance of a genome, which can carry out more systematic and detailed analysis on the structural variation of common insertion, deletion, duplication, inversion and translocation types in the genome.

According to whether the adopted data is simulation data or real data, the method for detecting the genome structure variation performance can be divided into the following steps: simulated data and real data.

On the simulation data, because of the reference set of the structural variation, the method for detecting the structural variation performance of the genome is suitable for objectively analyzing different structural variation performance detection methods and comparing the structural variation performance detection methods.

On the basis of real sequencing data, due to the lack of a reference set of structural variation, the detection result of the structural variation performance cannot be accurately analyzed. Therefore, the method provided by the invention is mainly used for carrying out objective analysis on the structural variation recognition result of the simulation data and mainly comprises two aspects.

The method is applicable to both simulation data and real data as long as a reference set exists;

calculating variation statistical results of insertion, deletion, duplication and inversion variation in the genome structural variation on quantity indexes based on a user variation identification result set and a reference set, and outputting the variation statistical results to a terminal screen;

the variation statistics of the number indexes of insertion, deletion, replication and inversion variation in the genome structural variation comprise:

The user identification result set of insertion, deletion, duplication or inversion mutation removes the number of mutations (ref _ reg _ size _ user _ long _ filtered) of mutation intervals with mutation length larger than 100kb after invalid mutation, removes the number of mutations (ref _ reg _ size _ user) of mutation intervals with mutation length larger than 100kb after invalid mutation, and removes the number of mutations (ref _ reg _ size _ user) of mutation intervals with insertion, deletion, duplication or inversion mutation in the reference set, and mainly counts the number of mutations in each mutation length with mutation length between 0 and 2kb in the reference set.

Secondly, calculating the quantity index of break point intervals in the translocation variation recognition result in the genome structure variation based on the user variation recognition result set and the reference set, and outputting the quantity index to a terminal screen;

the index of the number of breakpoint intervals in the translocation variation recognition result comprises:

And the user translocation variation recognition result is a breakpoint interval set, and the translocation variation breakpoint interval set in the reference set.

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: calculating variation statistical results of insertion, deletion, duplication and inversion variation in the genome structure variation on the quantity index based on the user variation identification result set and the reference set in the first step, and outputting the variation statistical results to a terminal screen; the specific process is as follows: the specific process is as follows:

because some interval length overlarge variations generally exist in the recognition results of the existing recognition methods (such as Sniffles, nextSV, PBHoney, SMRT-SV and the like), the variations have no obvious significance due to the overlarge length, the variations are considered to be invalid variations in the SV _ STAT method, and the variations need to be removed before the detection of the genome structure variation performance so as to obtain a more objective result. The greater the number of such invalid variations, the lower the quality of the recognition result.

Insertions, deletions, duplications, inverted mutations are represented as the five-tuple Region ═ (Chr, Start, End, Type, Size);

given user insertion, deletion, duplication, or inversion variations, the result set S1 and the benchmark set S2 are identified;

calculating an invalid mutation set with the mutation length of user insertion, deletion, duplication or inversion larger than 100 kb;

removing invalid mutations with user insertion, deletion, duplication or inversion mutation length larger than 100 kb;

Calculating the variation number (ref _ reg _ size _ user _ long _ filtered) of variation intervals after invalid variation with variation length larger than 100kb is removed in the user insertion, deletion, duplication or inversion variation identification result set, the variation number (ref _ reg _ size _ user) of variation intervals without invalid variation with variation length larger than 100kb is not removed, and the variation number (ref _ reg _ size _ benchmark) of insertion, deletion, duplication or inversion variation intervals in the reference set, and mainly counting the variation number of each variation length with the variation length between 0 and 2 kb;

inserting, deleting, copying or inverting a user into an invalid mutation set with the mutation length of more than 100 kb; removing the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; recall, accuracy, F₁score; the user inserts, deletes, copies or inverts the variation identification result to remove the variation number of the variation interval with the variation length larger than 100kb after the invalid variation and not remove the variation number of the variation interval with the variation length larger than 100kb of the invalid variation; and recording the variation number of the insertion, deletion, duplication or inversion variation intervals in the reference set into a file, and outputting the file to a terminal screen.

The number of mutations of 3 mutation intervals obtained by the steps (1) to (2) in "embodiment three" is removed from the user insertion, deletion, duplication, or inversion mutation recognition result set, and the format is as follows:

mutation recognition result set S₁Obtained by variation recognition tools (such as Sniffles, nextSV, PBHoney, SMRT-SV, etc.), and is the output result of the tools;

the reference set S2 is obtained through other approaches (e.g., research results of others) and is known;

third generation data is typically <100kb in length, and second generation data is shorter, typically <1kb in length, so that 100kb can be taken as a filtering threshold, and variations beyond this length will be considered to be over-sized variations.

The number of discriminatory variants, number of True Positives (TP), number of False Positives (FP), number of False Negatives (FN), Recall (Recall), Precision (Precision), F, were generally used for the simulation data₁And (5) detecting the genome structure variation performance by using indexes such as score and the like.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the first or different embodiment is that: calculating the true positive number and the false positive number of the insertion, deletion, duplication or inversion mutation identification result of the user after removing the invalid mutation; the number of true negatives identified and the number of false negatives not identified by insertion, deletion, duplication or inversion mutation in the reference set; and recall, accuracy, F₁score; the specific process is as follows:

(1) Traverse S₁Removing invalid variation with variation length larger than 100kb to obtain variation set S 'with variation length meeting requirements'₁；

(3) is prepared from S'₁Each variant Region of (1)_iAnd S₂Each variant Region of (1)_jMaking pairwise comparison, and comparing regions_iAnd Region_jAfter the two ends are respectively expanded by 100bp, the calculation and the overlapping are carried out, and the expanded regions are respectively carried out_iAnd Region_jMarking true if there is an overlap; otherwise, marking as false;

(6) precision of calculation Pression, Recall and F₁score is as follows:

in some cases, recall and precision may be required to maximize precision or recall at the expense of another indicator, and therefore, a harmonic mean F of precision and recall is typically used₁The score detects the comprehensive performance of the number index of the variation;

calculating recall rate, precision and F of variation recognition result₁The detailed method of score and other quantity indexes is shown in algorithm 3:

other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment mode and one of the first to third embodiment modes is: s 'in the above-mentioned (3)'₁Each variant Region of (1)_iAnd S₂Each variant Region of (1)_jMaking pairwise comparison, and comparing regions_iAnd Region_jAfter the two ends are respectively expanded by 100bp, the calculation and the overlapping are carried out, and the expanded regions are respectively carried out_iAnd Region_jMarking true if there is an overlap; otherwise, marking as false; the specific process is as follows:

taking S'₁Each variant Region of (1)_i＝(Chr,Star_i,End_i,Type_i,Size_i) Go through the reference set S₂For any variant Region in the reference set_j＝(Chr,Start_j,End_j,Type_j,Size_j) If Region_iAnd Region_jFrom the same chromosome Chr, in Region_iAnd Region_jRespectively extending extSize bases (SV _ STAT default 100bp) at both ends to obtain an extended new Region'_i＝(Chr,Start'_i,End'_i,Type_i,Size_i) And Region'_j＝(Chr,Start'_j,End'_j,Type_j,Size_j) (ii) a If Region_iAnd Region_jFrom a different chromosome Chr, no comparison is made (for insertions, deletions, duplications, inversion variations);

The two-end expansion can effectively process the possible position shift of the same variation in certain complex regions of the genome in different identification results caused by different comparison algorithms and variation identification methods, and the overlap between the concentrated variation of the user identification result and the concentrated variation of the benchmark can be more effectively calculated by expanding 100 bp.

The method for determining whether the two variations overlap is shown in algorithm 1:

other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to fourth embodiments is: if Region'_iAnd Region'_jIf at least 1 base is overlapped, the two bases are overlapped, and the variation interval barycentric distance deviation and the variation interval length ratio of the two (user variation and reference set variation) are calculated;

if Region'_iAnd Region'_jIf no overlap exists, no calculation is needed;

the specific process is as follows:

in the recognition results of the existing recognition methods (such as sniflles, nextSV, PBHoney, SMRT-SV and the like), there are usually cases where some variation regions have large deviations from the variation regions in the reference set, even some invalid recognition results with large lengths may occur, and also there may occur cases where the same variation is split into a plurality of adjacent variations, which results in distortion and deviation of the recognition results, thereby reducing the recognition accuracy; however, detailed analysis of variation between the variant call results has not been reported.

In order to more comprehensively and finely identify the deviation of the result, the invention applies the following two indexes to identify the deviation of the result and the target variation in the reference set:

1) the center of gravity distance of the two;

2) the interval length ratio of the two;

this bias applies only to insertions, deletions, duplications, inversions, whereas translocational variations are usually located on different chromosomes and thus do not have this bias.

The center of gravity of the variation Region ═ (Chr, Start, End, Type, Size) is defined as follows:

if Region'_iS and Region of'_jIf there is an overlap between the variations t, the barycentric distance deviation between the variation s and the variation interval of the variation t is:

d＝m_s-m_t

in the formula, m_sIs the center of gravity of the variation region s, m_tIs the center of gravity of the variation interval t;

The closer the ratio is to 1, the smaller the deviation is, and the more accurate the identification result is; conversely, the lower the accuracy of the recognition result.

Translocation variation did not require interval difference analysis.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is: calculating the quantity index of break point intervals in the translocation variation recognition result in the genome structure variation based on the user variation recognition result set and the reference set, and outputting the quantity index to a terminal screen; the specific process is as follows:

the translocation type mutation is stored in a BEDPE format, and the length of the mutation is usually large, so that the whole mutation interval is difficult to completely identify, and therefore, for the translocation type mutation, an SV _ STAT detects an identification result from the viewpoint of a breakpoint;

translocation variant is represented as octave Tra ═ (Chr'₁,Start'₁,End'₁,Chr'₂,Start'₂,End'₂,Type,Size)；

Wherein Chr'₁Number of chromosome before translocation mutation, Chr'₂Start 'being the number of the chromosome on which the translocation varied'₁Is chromosome Chr 'before translocation mutation'₁Start of'₂After translocation mutation, stainingBody Chr'₂From start position of, End'₁Is chromosome Chr 'before translocation mutation'₁End position of (d)'₂Is chromosome Chr 'before translocation mutation'₂Taking translocation TRA and Size as variation Size and 0 as Type as variation Type;

because it is usually difficult to identify all the coordinate information of a translocation variant in the identification result, some coordinate information is missing, so that each translocation variant contains 4 Break Points (BP) at most (break points refer to that when a genome sequence is changed (such as translocation), fragments are firstly broken and then removed, and the broken genome position is called a break point, one fragment is broken, two break points are usually generated, if two genome fragments are exchanged, 4 break points are theoretically generated);

an extSize base number (default 100bp) construction breakpoint interval is expanded around each breakpoint, and the process is as follows:

at Chr₁Start of (2)₁Extending extSize bases on two sides to obtain breakpoint interval Region₃＝(Chr₁,Start₁-extSize,Start₁+ extSize, TRA, 0); in the same way, in Chr₁End of (2)₁Extending extSize bases on two sides to obtain breakpoint interval Region₄＝(Chr₁,End₁-extSize,End₁+ extSize, TRA, 0); in the same way, in Chr₂Start of (2)₂Extending extSize bases on two sides to obtain breakpoint interval Region₅＝(Chr₂,Start₂-extSize,Start₂+ extSize, TRA, 0); in the same way, in Chr₂End of (2)₂Extending extSize bases on two sides to obtain breakpoint interval Region₆＝(Chr₂,End₂-extSize,End₂+extSize,TRA,0)；

Judging whether translocation variant intervals corresponding to the user variant identification result set are overlapped with translocation variant intervals in the reference set, if so, calling translocation variants to be identified, and otherwise, calling the translocation variants not to be identified;

calculating the true positive number and the false positive number of the break point interval in the translocation variation identification result of the user; benchmark centralized translocation variant breakpointsThe number of true negatives identified in the interval and the number of false negatives not identified; and precision, recall, F₁ score；

Calculating a breakpoint interval set in a user translocation variation identification result and a breakpoint interval set of translocation variation in a reference set;

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is: calculating the true positive number and the false positive number of break point intervals in the translocation variation identification result of the user; the number of true negatives identified and the number of false negatives not identified in the translocation variation breakpoint interval in the reference set; and precision, recall, F₁score; the specific process is as follows:

(1) extracting translocation variant in user variant identification result set to form translocation variant set T of identification result₁；

(2) For T₁Each translocation variant of (Tra ═ (Chr)₁,Start₁,End₁,Chr₂,Start₂,End₂TRA,0), breakpoint interval Region is constructed by extending extSize (100bp) at both ends₃、Region₄、Region₅And Region₆；

The breakpoint interval Region₃、Region₄、Region₅And Region₆The acquisition process comprises the following steps:

at Chr₁Start of (2)₁Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₃＝(Chr₁,Start₁-extSize,Start₁+extSize,TRA,0)；

At Chr₁End of (2)₁Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₄＝(Chr₁,End₁-extSize,End₁+extSize,TRA,0)；

At Chr₂Start of (2)₂Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₅＝(Chr₂,Start₂-extSize,Start₂+extSize,TRA,0)；

At Chr₂End of (2)₂Respectively expanding extSize bases on two sides to obtain a breakpoint interval Region₆＝(Chr₂,End₂-extSize,End₂+extSize,TRA,0)；

(3) merging breakpoint intervals Region obtained by translocation variation in the step (2)₃、Region₄、Region₅And Region₆Generating breakpoint interval set T 'of translocation mutation in user mutation identification result set'₁；

(5) Go through T 'in sequence'₁Record Region of each breakpoint interval in the_xAt T'₂Middle search and Region_xOverlapping breakpoint intervals of at least 1 baseRecorded as set { Region }_y1,Region_y2,…,Region_ymAnd if the set is not empty, the breakpoint interval Region is called_xIs identified and Region is identified_xIs marked as true, let { Region }_y1,Region_y2,…,Region_ymThe corresponding overlap mark in the page is marked as true; if the set is empty, then the breakpoint interval Region is called_xNot identified as false, mark T'₂Marking each break point interval as false;

(8) precision Presision for calculating breakpoint intervals_BPRecall rate recalling_BPAnd F₁score is as follows:

the method for detecting the overlap between two translocation variant breakpoint intervals is shown in algorithm 4.

Calculating recall rate, precision and F of translocation variation recognition result₁The detailed method of score and other quantity indexes is shown in algorithm 5:

other steps and parameters are the same as those in one of the first to sixth embodiments.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

SV _ STAT Performance test

Detection of human genome No. 1 chromosome insertion and deletion variation

Based on single-molecule simulation sequencing data of chromosome 1 of the human genome, the identification of insertion and deletion variation in the human genome is researched. A paper published by Kai Ye et al in Nature Medicine in 2016^[15]TCGA data is used to identify complex insertion, deletion variations in the human cancer genome. According to 1128 insertion and deletion variations (minimum length 2bp and maximum length 889bp) of chromosome 1 of the human genome published by the paper, PacBio simulation data with 100 multiplied sequencing depth is generated, and then different identification methods are applied to identify the insertion and deletion variations. Finally, the SV _ STAT was used to detect the recognition results, which are shown in Table 1.

TABLE 1 statistics of major results of insertion and deletion identification of human genome No. 1 chromosome simulation data

The total number of simulated variations was 1128. The better result items are shown bolded. Number of true positives^aRefers to the amount of true positive variation in the result set after removing invalid variation data with a length greater than 10 kb.^bThe Sniffles statistic did not contain 48 genomic regions greater than 10kb in length, with the longest interval of 216Mbp, 87.4% of the total length of chromosome 1.^cThe SMRT-SV statistic did not contain 1 genomic region of 14.8kb in length.^dThe statistical result of nextSV (positive) does not include 154 genomic regions with a length of more than 10 kb.

As can be seen from Table 1, ASVCLR has the best recognition result, recognizes 934 variant regions in total, has low false positive number and false negative number (19 and 208, respectively), has recall rate of 81.6%, precision of 98.0%, and F₁score was 89.1%.

In addition, recall rate, precision and F of different identification methods can be drawn according to statistical results₁Comparative plot of score, as shown in FIG. 4. As can be seen from FIG. 4, the recognition result of ASVCLR has higher recall rate, precision and F₁ score。

Based on the human genome chromosome 1 simulation data, the recognition of insertion and deletion variation was further analyzed by using the SV _ STAT method, as shown in fig. 5. The Benchmark curve describes the number of variations with different lengths in the reference set, and the total number of variations in the data set is 1128, wherein the number of variations with a length of less than 100bp is 1116, and the number of variations with a length of less than 10bp is 846 (accounting for 75% of the total number of variations). As can be seen from FIG. 4, ASVCLR can effectively recognize variation in genome, both NextSV and PBHoney can not effectively recognize variation of such length, while SMRT-SV can recognize a large amount of variation with smaller length (up to 14305), but most of the variation is false positive data and has low accuracy.

Performance detection analysis of multiple variation recognition results of saccharomyces cerevisiae (S.pombe)

And (3) carrying out performance detection analysis on the variation recognition results of different variation recognition methods on a saccharomyces cerevisiae (S.pombe 972h-) genome by using SV _ STAT. SV _ STAT compares the variation interval deviation of different recognition results. Since insertions and deletions are the two most common types of variation in the genome, we analyzed the barycentric distances between the regions for identifying the insertion and deletion variation in the s.cerevisiae (s.pommbe 972h-) genome using simulation data, as shown in fig. 5. As can be seen from fig. 5, the centroid distances of the variations of different recognition results are all located near 0, and the centroid distance of the recognition result variation position of ASVCLR is closest to 0, which has a more precise variation boundary than other methods.

The SV _ STAT is used for carrying out interval length ratio statistical experiments of different recognition results on the simulation data, the distribution situation of the length ratio of the variation interval to the variation interval in the reference set in the recognition results is counted, the closer the interval length ratio is to 1, the more complete the identification of the variation interval is, the better the accuracy is, and the results are shown in FIGS. 6a, 6b, 6c, 6d, 6e and 6 f. As can be seen from fig. 6a, 6b, 6c, 6d, 6e, and 6f, the length ratio of the variation interval in the ASVCLR recognition results is closest to 1, and then snifls, and the recognition effects of the other methods have large variations in the size of the variation interval. Meanwhile, we can also see that Nextsv and PBHoney have fewer variations in correct recognition, and the variation recognition performance is weaker.

Taking the replication type variation as an example, the SV _ STAT is used to analyze the variation recognition result of different variation performance detection methods on the genome of Saccharomyces cerevisiae (S.pommbe 972 h-). The simulation data contained 100 replicative variants with lengths ranging from 50bp to 10kb, which were identified using different identification methods. The results of performance testing of SV _ STAT at different variation length intervals are shown in FIGS. 7a, 7b, 7c, 7d, 7e, and 7 f. As can be seen from fig. 7a, 7b, 7c, 7d, 7e, and 7f, ASVCLR has better recognition performance in different variation intervals, and the results of snifls recognition are mainly concentrated in the interval >250bp, but the recognition performance is weaker for the variation with smaller length. The recognition performance of nextvs and PBHoney is weak, and a large number of variations cannot be successfully recognized.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Reference documents:

1 J Weischenfeldt,O Symmons,F Spitz,JO Korbel.Phenotypic Impact of Genomic Structural Variation:Insights from and for Human Disease.Nat Rev Genet2013,14(2):125-138.

2 C Alkan,BP Coe,EE Eichler.Genome Structural Variation Discovery and Genotyping.Nat Rev Genet2011,12(5):363-376.

3 K Chen,JW Wallis,MD McLellan,DE Larson,JM Kalicki,CS Pohl,SD McGrath,MC Wendl,QY Zhang,DP Locke,et al.Breakdancer:An Algorithm for High-Resolution Mapping of Genomic Structural Variation.Nature Methods2009,6(9):677-U676.

4 JO Korbel,A Abyzov,XJ Mu,N Carriero,P Cayting,Z Zhang,M Snyder,MB Gerstein.Pemer:A Computational Framework with Simulation-Based Error Models for Inferring Genomic Structural Variants from Massive Paired-End Sequencing Data.Genome Biol2009,10(2):R23.

5 A Abyzov,AE Urban,M Snyder,M Gerstein.Cnvnator:An Approach to Discover,Genotype,and Characterize Typical and Atypical Cnvs from Family and Population Genome Sequencing.Genome Res2011,21(6):974-984.

6 K Ye,MH Schulz,Q Long,R Apweiler,ZM Ning.Pindel:A Pattern Growth Approach to Detect Break Points of Large Deletions and Medium Sized Insertions from Paired-End Short Reads.Bioinformatics2009,25(21):2865-2871.

7 J Wang,CG Mullighan,J Easton,S Roberts,SL Heatley,J Ma,MC Rusch,K Chen,CC Harris,L Ding,et al.Crest Maps Somatic Structural Variation in Cancer Genomes with Base-Pair Resolution.Nat Methods2011,8(8):652-654.

8 AC English,WJ Salerno,JG Reid.Pbhoney:Identifying Genomic Variants Via Long-Read Discordance and Interrupted Mapping.BMC Bioinformatics2014,15:180.

9 FJ Sedlazeck,P Rescheneder,M Smolka,H Fang,M Nattestad,A von Haeseler,MC Schatz.Accurate Detection of Complex Structural Variations Using Single-Molecule Sequencing.Nat Methods2018,15(6):461-468.

10 L Fang,J Hu,D Wang,K Wang.Nextsv:A Meta-Caller for Structural Variants from Low-Coverage Long-Read Sequencing Data.BMC Bioinformatics2018,19(1):180.

11 M Cretu Stancu,MJ van Roosmalen,I Renkens,MM Nieboer,S Middelkamp,J de Ligt,G Pregno,D Giachino,G Mandrile,J Espejo Valle-Inclan,et al.Mapping and Phasing of Structural Variation in Patient Genomes Using Nanopore Sequencing.Nature communications2017,8(1):1326.

12 L Gong,CH Wong,WC Cheng,H Tjong,F Menghi,CY Ngan,ET Liu,CL Wei.Picky Comprehensively Detects High-Resolution Structural Variants in Nanopore Long Reads.Nat Methods2018,15(6):455-460.

13 J Huddleston,MJP Chaisson,KM Steinberg,W Warren,K Hoekzema,D Gordon,TA Graves-Lindsay,KM Munson,ZN Kronenberg,L Vives,et al.Discovery and Genotyping of Structural Variation from Long-Read Haploid Genome Sequence Data.Genome Res2017,27(5):677-685.

14 MR Vollger,PC Dishuck,M Sorensen,AE Welch,V Dang,ML Dougherty,TA Graves-Lindsay,RK Wilson,MJP Chaisson,EE Eichler.Long-Read Sequence and Assembly of Segmental Duplications.Nat Methods2019,16(1):88-94.

15 K Ye,J Wang,R Jayasinghe,EW Lameijer,JF McMichael,J Ning,MD McLellan,M Xie,S Cao,V Yellapantula,et al.Systematic Discovery of Complex Insertions and Deletions in Human Cancers.Nature medicine2016,22(1):97-104.

34页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于家系denovo突变的分析方法及其应用

Genome structure variation performance detection method based on reference set

相关技术

网友询问留言