Method and device for detecting TMB

文档序号:1629550 发布日期:2020-01-14 浏览:33次 中文

阅读说明:本技术 检测tmb的方法及装置 (Method and device for detecting TMB ) 是由 董永芳 郭璟 楼峰 曹善柏 于 2019-10-18 设计创作,主要内容包括:本发明提供了一种检测TMB的方法及装置。该方法包括:利用配对白细胞的测序数据去除待测样本的测序数据中的胚系突变位点得到候选体细胞突变位点集;过滤候选体细胞突变位点集中的假阳性体细胞突变位点得到待测体细胞突变位点集,假阳性体细胞突变位点包括如下至少之一:氧化损伤导致的突变位点,背景噪音导致的突变位点;将待测体细胞突变位点集中的负荷突变个数除以测序数据在外显子区域的所有长度,得到TMB。通过充分利用配对白细胞、背景噪音突变频率分布数据库及氧化损伤来去除假阳性体细胞突变,从而提高了TMB的准确性及稳定性。(The invention provides a method and a device for detecting TMB. The method comprises the following steps: removing the germ line mutation sites in the sequencing data of the sample to be detected by using the sequencing data of the paired white blood cells to obtain a candidate somatic mutation site set; filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise; and dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB. False positive somatic mutation is removed by fully utilizing paired white blood cells, a background noise mutation frequency distribution database and oxidative damage, so that the accuracy and stability of TMB are improved.)

1. A method of detecting a TMB, the method comprising:

removing embryonic system mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate somatic mutation site set;

filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise;

and dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.

2. The method of claim 1, wherein the false positive somatic mutation sites comprise oxidative damage-induced mutation sites, and wherein prior to filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are oxidative damage-induced mutation sites;

preferably, determining whether a somatic mutation site in the set of candidate somatic mutation sites is a mutation site caused by oxidative damage comprises:

searching reads supporting the candidate somatic mutation sites, and judging whether the reads are positioned in a positive strand or a negative strand;

counting the ratio of the number of reads of the positive strand to the number of reads of the negative strand that support the candidate somatic mutation site,

judging whether the ratio is greater than a first threshold or smaller than a second threshold, if so, determining that the candidate somatic mutation site is a mutation site caused by oxidative damage;

preferably, the first threshold is 2 or more, and the second threshold is 0.5 or less.

3. The method of claim 1 or 2, wherein the set of false positive somatic mutation sites comprises background noise-induced mutation sites, and wherein prior to filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are background noise-induced mutation sites;

preferably, determining whether a somatic mutation site in the set of candidate somatic mutation sites is a background noise-induced mutation site comprises:

removing embryonic system mutation sites in sequencing data of healthy people by using the sequencing data of the white blood cells to obtain a somatic mutation site set of the healthy people;

establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by using the somatic mutation site set of the healthy population;

calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected, and calculating the P value of the mutation frequency of each candidate somatic mutation site in the Weibull distribution model;

judging whether the P value is larger than or equal to a third threshold value, if so, the candidate somatic mutation site is a mutation site caused by background noise;

preferably, the third threshold value is equal to or greater than 0.05.

4. The method of claim 1 or 2, wherein the number of load mutations in the set of somatic mutation sites is divided by the length of all exon regions of the sequencing data, and the method further comprises: counting the number of load mutations in the somatic mutation site set;

preferably, counting the number of load mutations in the set of somatic mutation sites comprises:

counting the total number of all mutation types in the somatic mutation site set as follows: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation;

removing at least one of the following sites from the total number to obtain the number of the load mutations: a mutation site with the thousand-person mutation frequency being more than 0.01 and a mutation site marked as COSMIC.

5. The method of claim 1, wherein dividing the set of somatic mutation sites by the sequencing data before all lengths of the exon regions, the method further comprises: calculate all the length of the sequencing data in the exoscope region.

6. An apparatus for detecting a TMB, the apparatus comprising:

the detection module is used for removing the germline mutation sites in the sequencing data of the sample to be detected by using the paired sequencing data of the white blood cells to obtain a candidate somatic mutation site set;

a filtering module, configured to filter false positive somatic mutation sites in the candidate somatic mutation site set to obtain a set of somatic mutation sites to be detected, where the false positive somatic mutation sites include at least one of: sites of mutations due to oxidative damage, sites of mutations due to background noise;

and the TMB calculation module is used for dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exogenetic region to obtain the TMB.

7. The apparatus of claim 6, further comprising an oxidative damage determination module for determining whether the somatic mutation sites in the set of candidate somatic mutation sites are oxidative damage-induced mutation sites;

preferably, the oxidative damage determining module includes:

the searching module is used for searching reads supporting the candidate somatic mutation sites and judging whether the reads are positioned in a positive strand or a negative strand;

a first statistical module for counting a ratio of the number of reads of the positive strand to the number of reads of the negative strand that support the candidate somatic mutation site,

a ratio judgment module, configured to judge whether the ratio is greater than a first threshold or smaller than a second threshold, if so, the candidate somatic mutation site is a mutation site caused by oxidative damage;

preferably, the first threshold is 2 or more, and the second threshold is 0.5 or less.

8. The apparatus of claim 6 or 7, further comprising a background noise determination module for determining whether a somatic mutation site in the set of candidate somatic mutation sites is a mutation site caused by background noise;

preferably, the background noise determination module includes:

the healthy site set acquisition module is used for removing embryonic system mutation sites in sequencing data of healthy people by utilizing the sequencing data of the white blood cells to obtain a somatic mutation site set of the healthy people;

the model establishing module is used for establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by utilizing the somatic mutation site set of the healthy population;

a P value calculation module, configured to calculate a mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be tested, and calculate a P value of the mutation frequency of each candidate somatic mutation site in the Weibull distribution model;

the noise judgment module is used for judging whether the P value is greater than or equal to a third threshold value, if so, the candidate somatic mutation site is a mutation site caused by background noise;

preferably, the third threshold value is equal to or greater than 0.05.

9. The apparatus of claim 6 or 7, further comprising: the load mutation number counting module is used for counting the load mutation number in the somatic mutation site set;

preferably, the load mutation number statistic module includes:

a statistical unit for counting the total number of all the following mutation types in the set of somatic mutation sites: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation;

a removing unit, configured to remove at least one of the following sites from the total number to obtain the number of the load mutations: a mutation site with the frequency of thousands of people more than 0.01 and a mutation site marked as COSMIC.

10. The apparatus of claim 6, further comprising: and the length calculation module is used for calculating all the lengths of the sequencing data in the exoscope region.

11. A storage medium having stored thereon a computer-executable program, wherein the program is configured to, when executed, perform a method of detecting a TMB according to any one of claims 1 to 5.

12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is configured to execute the computer program to perform the method of detecting a TMB of any of claims 1 to 5.

Technical Field

The invention relates to the field of gene sequencing data analysis, in particular to a method and a device for detecting TMB.

Background

Tumor Mutation Burden (TMB) is an indicator of the total number of Tumor somatic mutations in a responding Tumor cell, usually expressed as the total number of Tumor somatic mutations contained per megabase (Mb) of the Tumor genomic region. Tumors with high levels of TMB, representing a higher number of mutations in their tumor cells, further indicate that the higher the number of tumor neoantigens (Neoantigen) that can be recognized by the immune system in tumor cells, may be, thereby helping immune cells to produce more effective killing of tumor cells.

The currently commonly used method for detecting tumor mutation load is a strategy proposed by Lawrence team 2015 in Nature, and the tumor mutation load state is judged by calculating the number of somatic mutations of the whole exome (average depth < 200X). However, this method often occurs with false positives and false negatives.

Therefore, it is urgently required to develop a new method for detecting TMB.

Disclosure of Invention

The invention mainly aims to provide a method and a device for detecting a TMB (transient response message) so as to solve the problem of inaccurate TMB detection in the prior art.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method of detecting a TMB, the method including: removing embryonic system mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate somatic mutation site set; filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise; and dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.

Further, the false positive somatic mutation sites include oxidative damage-induced mutation sites, and before filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are oxidative damage-induced mutation sites.

Further, determining whether the somatic mutation sites in the candidate set of somatic mutation sites are oxidative damage-induced mutation sites comprises: searching reads supporting candidate somatic mutation sites, and judging whether the reads are positioned in a positive strand or a negative strand; counting the ratio of the number of reads of the positive strand to the number of reads of the negative strand of the candidate somatic mutation site, and judging whether the ratio is greater than a first threshold or smaller than a second threshold, if so, the candidate somatic mutation site is a mutation site caused by oxidative damage; preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.

Further, the set of false positive somatic mutation sites includes mutation sites caused by background noise, and before filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are mutation sites caused by background noise.

Further, determining whether a somatic mutation site in the candidate set of somatic mutation sites is a mutation site caused by background noise comprises: removing embryonic system mutation sites in sequencing data of healthy people by using sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people; establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by using a somatic mutation site set of healthy people; calculating the mutation frequency of each candidate somatic mutation site in a candidate somatic mutation site set of a sample to be detected, and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model; judging whether the P value is larger than or equal to a third threshold value, if so, taking the candidate somatic mutation site as a mutation site caused by background noise; preferably, the third threshold value is equal to or greater than 0.05.

Further, dividing the number of load mutations in the set of somatic mutation sites by the length of sequencing data in the exon region, the method further comprises: and counting the number of load mutations in the somatic mutation site set.

Further, the statistics of the number of load mutations in the set of somatic mutation sites includes: the total number of all mutation types in the somatic mutation site set was counted as follows: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; removing at least one of the following sites from the total number to obtain the number of load mutations: a mutation site with the thousand-person mutation frequency being more than 0.01 and a mutation site marked as COSMIC.

Further, dividing the set of somatic mutation sites by sequencing data prior to all lengths of the exon regions, the method further comprising: calculate all the length of the sequencing data in the exoscope region.

In order to achieve the above object, according to an aspect of the present invention, there is provided an apparatus for detecting a TMB, the apparatus including: the detection module is used for removing germline mutation sites in sequencing data of a sample to be detected by using the sequencing data of paired leukocytes to obtain a candidate somatic mutation site set; the filtering module is used for filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, and the false positive somatic mutation sites comprise at least one of the following: sites of mutations due to oxidative damage, sites of mutations due to background noise; and the TMB calculation module is used for dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.

Further, the device also comprises an oxidative damage judging module which is used for judging whether the somatic mutation sites in the candidate somatic mutation site set are the mutation sites caused by oxidative damage.

Further, the oxidation damage judgment module comprises: the device comprises a searching module, a first statistic module and a ratio judging module, wherein the searching module is used for searching reads supporting candidate somatic mutation sites and judging whether the reads are positioned in a positive strand or a negative strand; the first statistic module is used for counting the ratio of the number of reads of a positive strand and the number of reads of a negative strand supporting the candidate somatic mutation sites, the ratio judgment module is used for judging whether the ratio is larger than a first threshold or smaller than a second threshold, and if yes, the candidate somatic mutation sites are mutation sites caused by oxidative damage; preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.

Further, the device also comprises a background noise judging module for judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by background noise.

Further, the background noise determination module comprises: the system comprises a health site set acquisition module, a model establishment module, a P value calculation module and a noise judgment module, wherein the health site set acquisition module is used for removing embryonic system mutation sites in sequencing data of healthy people by utilizing sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people; the model establishing module is used for establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by utilizing a somatic mutation site set of healthy people; the P value calculation module is used for calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model; the noise judgment module is used for judging whether the P value is larger than or equal to a third threshold value, if so, the candidate somatic mutation site is a mutation site caused by background noise; preferably, the third threshold value is equal to or greater than 0.05.

Further, the apparatus further comprises: and the load mutation number counting module is used for counting the load mutation number in the somatic mutation site set.

Further, the load mutation number statistic module comprises: a statistic unit and a removal unit, wherein the statistic unit is used for counting the total number of all the following mutation types in the somatic mutation site set: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; the removal unit is used for removing at least one of the following sites from the total number to obtain the number of the load mutation: a mutation site with the frequency of thousands of people more than 0.01 and a mutation site marked as COSMIC;

further, the apparatus further comprises: and the length calculation module is used for calculating all the lengths of the sequencing data in the exoscope area.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer-executable program configured to, when executed, perform any one of the above-described methods of detecting a TMB.

According to a fourth aspect of the present invention, there is provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to execute the computer program to perform any of the above-described methods of detecting a TMB.

By applying the technical scheme of the invention, firstly, the germ line mutation carried by the sample is removed by matching with the white blood cells, so that the influence of the germ line mutation on the TMB value is greatly reduced; secondly, removing false positive sites caused by DNA oxidative damage caused by steps of constructing a library, breaking NDA fragments and the like; and/or removing the influence of the false positive somatic mutation caused by low-frequency background noise on the TMB value through a background noise frequency distribution database of healthy people, namely removing the false positive somatic mutation by fully utilizing the paired white blood cells, the background noise mutation frequency distribution database and oxidative damage, and improving the accuracy and stability of the TMB value.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 shows a flow diagram of a method of detecting a TMB in a preferred embodiment according to the present invention;

FIG. 2 shows a detailed flow diagram of a method of detecting a TMB in a preferred embodiment according to the present invention; and

fig. 3 shows a schematic structural diagram of an apparatus for detecting a TMB in a preferred embodiment according to the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

Reference sequence (Refseq) species reference standard genomic sequence.

Fusion gene (Fusion gene) refers to a process in which sequences of all or a part of two genes are fused to each other to form a new gene. It may be the result of a chromosomal translocation, an intermediate deletion or a chromosomal event.

Tumor mutation burden (TMB, Tumor mutation burden): the total number of somatic gene coding errors, base substitutions, gene insertion or deletion errors detected per million bases.

Germ line mutation (germine mutation) germ cell mutation, mutation derived from germ cells such as sperm or ovum.

Reads genomic or transcriptome sequence fragments

Synonymous mutations: substitution mutations that do not alter the amino acid sequence of the peptide chain product

Non-synonymous mutations: gene mutations that result in changes in the amino acid sequence or changes in the base sequence of functional RNA of a polypeptide product

Frame shift mutation: a mutation which causes the dislocation of a sequence of coding sequences following the insertion or loss of a certain site in a DNA fragment by the insertion or loss of one or several (not a multiple of 3 or 3) base pairs

Non-frameshift mutations: a mutation which is inserted or lost at a certain site in a DNA fragment by one or several (3 or 3 fold) base pairs without misplacing a sequence of coding sequences following the insertion or loss site

PE sequencing: double-ended sequencing, a sequencing method

read 1/2: in the PE sequencing data, read1 represents the nucleotide sequence obtained in the first round of the test, and read2 represents the nucleotide sequence obtained in the second round of the test.

bwa: a comparison method software is used for searching the position of reads in Refseq, and finally obtaining a bam format file.

The adapter sequence: linker sequences flanking the DNA fragment in the sequencing.

flag: and the bam format file is used for describing a value of information such as a sequence alignment mode, a direction and the like.

cigar: a brief alignment information expression, which represents the alignment results using data plus letters based on the reference sequence.

duplicate: repetitive sequence refers to a sequence amplified by PCR.

qname: the number of fragments (template) is aligned.

Oxidative damage of DNA: of the A, T, G and C four bases, the C8 position in G readily binds oxygen, the G base becomes 8-oxo-G, and the resulting 8-oxo-G fusion then binds to base A, resulting in the detection of a false positive mutation from G to T.

COSMIC: COSMIC is an abbreviation for "cancer somatic mutation List" that encompasses the scientific literature and literature from large-scale experimental screening of the Sanger institute cancer genome project. The database is intended to collect and display information on cancer somatic mutations.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As mentioned in the background art, the TMB detection method in the prior art has the defect of inaccurate detection, and in order to improve the current situation, the inventors have analyzed and studied the existing TMB detection method, and found that the existing method cannot completely filter leukocyte mutation and systematic background error, and mutation actually lower than the threshold is filtered through mutation frequency threshold screening, so that the calculated value of TMB has a certain deviation. On the basis, the inventor proposes an improvement scheme of the application.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于Python的水文地球化学舒卡列夫分类方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!