Microsatellite instability analysis method and device

文档序号:1075101 发布日期:2020-10-16 浏览:5次 中文

阅读说明:本技术 一种微卫星不稳定分析方法及装置 (Microsatellite instability analysis method and device ) 是由 周衍庆 陈实富 王周阳 于 2020-07-02 设计创作,主要内容包括:本发明提供一种微卫星不稳定分析方法及装置,该方法包括;数据获取步骤,微卫星位点初次筛选步骤,样本训练步骤,微卫星位点再次筛选步骤,微卫星位点三次筛选步骤,微卫星稳定判断步骤。本发明能够基于现有的NGS检测数据分析结果挑选微卫星位点,直接分析样本的MSI状态,使用多个微卫星位点来分析MSI的状态,克服了PCR方法检测连续性差、容易误判的缺陷。(The invention provides a method and a device for analyzing instability of a microsatellite, wherein the method comprises the following steps of; the method comprises the steps of data acquisition, primary screening of the microsatellite loci, sample training, secondary screening of the microsatellite loci, tertiary screening of the microsatellite loci and stability judgment of the microsatellite. The invention can select the microsatellite loci based on the analysis result of the existing NGS detection data, directly analyze the MSI state of the sample, and analyze the MSI state by using a plurality of microsatellite loci, thereby overcoming the defects of poor detection continuity and easy misjudgment of a PCR method.)

1. A method for microsatellite instability analysis, comprising:

a data acquisition step, which is to respectively acquire sequencing data of at least one individual tumor sample and normal control sample;

a primary screening step of microsatellite loci, namely screening the microsatellite loci from sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training step, selecting a microsatellite stable tumor sample and a microsatellite unstable tumor sample as training sets, and respectively calculating the tumor sample insertion length distribution Pi and the normal control sample insertion length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setsjCalculating the bulldozing distance EMD value between the two distributions;

a step of screening the microsatellite loci again, which is to calculate the significance value of the EMD value of each microsatellite locus in the tumor samples with stable microsatellites and the tumor samples with unstable microsatellites in the training set, and reserve the microsatellite locus Si with the significance value smaller than a first threshold value to obtain a candidate microsatellite locus set S2;

a third screening step of the microsatellite loci, namely setting Cut-off values for judging the microsatellite loci according to the EMD values of the microsatellite loci in the candidate microsatellite locus set S2 in a tumor sample with stable microsatellites, and screening to obtain stable microsatellite loci and unstable microsatellite loci;

a microsatellite stability judgment step, namely acquiring sequencing data of a sample to be detected and a normal control sample from the same subject, and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

2. The method of microsatellite instability analysis according to claim 1, wherein said microsatellite locus satisfies at least one of the following conditions when said microsatellite locus is initially selected:

1) the repetition times of the single base of the site is more than or equal to 6 times;

2) the repetition times of 2-4 bases of the locus are more than or equal to 5 times;

and/or, in the secondary screening step for microsatellite loci, the first threshold is 0.05.

3. The microsatellite instability analysis method according to claim 1, wherein the method for obtaining said set of candidate microsatellite loci S1 by primary screening comprises:

according to the sequencing data of the normal control sample, calculating a ratio mean value R1 of the effective depth of each microsatellite locus and the average depth of the sample, calculating a comparison quality value MAPQ of each microsatellite locus to be less than X or a ratio mean value R2 of reads with multiple comparison phenomena in the total coverage reads of the locus, removing the microsatellite loci with R1 to be less than a second threshold value or R2 to be more than a third threshold value, and obtaining the candidate microsatellite locus set S1.

4. The method of microsatellite instability analysis according to claim 3, wherein said X is 30;

and/or, the second threshold is 33%;

and/or, the third threshold is 5%.

5. The microsatellite instability analysis method according to claim 1, wherein after the candidate microsatellite locus set S1 is obtained by primary screening, on a reference genome, a short sequence on the left of the starting point position of the microsatellite locus is taken as a left primer LPrimer, and a short sequence on the right of the ending point position is taken as a right primer RPrimer, preferably, the lengths of the left primer LPrimer and the right primer RPrimer are more than or equal to 5bp, and more preferably 5 bp;

and/or, in the sample training step, when a tumor sample with stable microsatellites and a tumor sample with unstable microsatellites are selected, the method for detecting whether the microsatellites of the sample are stable is selected from at least one of a multiplex fluorescence PCR-capillary electrophoresis method and an immunohistochemical staining method.

6. The method of microsatellite instability analysis according to claim 1, wherein each microsatellite locus S in said training set is calculatediDistribution Pi of tumor sample insertion lengths and distribution Q of normal control sample insertion lengthsjThe method comprises the following steps:

extraction of overlay microsatellite loci from sequencing data of each tumor sampleSiAnd the reads of the left primer LPrimer and the right primer RPrimer are spliced according to the comparison position to obtain a reduced DNA sequence SEQ, if the number of the sequences SEQ of the microsatellite loci in the sample is less than a fourth threshold, statistics is not included, the step of calculating the distribution of the insertion length is skipped, and if the number of the sequences SEQ of the microsatellite loci in the sample is more than or equal to the fourth threshold, the distribution of the insertion length is calculated, preferably, the fourth threshold is 30.

7. The method of microsatellite instability analysis according to claim 6, wherein each microsatellite locus S in said training set is calculatediDistribution Pi of tumor sample insertion lengths and distribution Q of normal control sample insertion lengthsjThe method comprises the following steps:

calculating the distance between the left primer LPrimer and the right primer RPrimer of the sequence SEQ obtained by splicing to obtain the insertion length L, and obtaining the statistical distribution of the normalized insertion length, namely the microsatellite locus SiTumor sample insertion length distribution PiNormal control sample insertion length distribution QiCalculating the dozing distance EMD value between the two distributions;

and/or the calculation formula of the dozing distance EMD value is as follows:

Figure FDA0002566125500000021

P=(p1wp1), (pm, wpm), P having m lengths, wpi being the ratio of the lengths;

q ═ Q1, wq 1., (qn, wqn), Q has n lengths, wqj is the ratio of the lengths;

matrix [ d ]ij]Each item of dijRepresents piAnd q isjLength difference of (d), matrix [ f)ij]Each term of fijRepresents from piTo qjFind fij]Of (d) is determined* ij]Then calculating to obtain an EMD value;

and/or, using the Hungarian algorithm, find [ f [ ]ij]Of (d) is determined* ij];

And/or deleting the microsatellite unstable sites of samples which do not meet the conditions that SEQ is more than or equal to the fourth threshold value and exceed the fifth threshold value in the training set samples before calculating the EMD value difference significance value of each microsatellite locus in the training set in the microsatellite stable samples and the microsatellite unstable samples;

and/or, the fifth threshold is 10%;

and/or, calculating a significance value of EMD of each microsatellite locus in the training set in a microsatellite stabilized sample and a microsatellite unstable sample using a T test;

and/or, the significance value is a P value;

and/or in a stable sample of the microsatellite, according to the EMD value of the microsatellite locus in the candidate microsatellite locus set S2, performing descending order sorting on the EMD value of the microsatellite locus, taking the EMD value of the Y percent as a Cut-off value for judging whether the locus is unstable, determining the microsatellite locus with the EMD value greater than the Cut-off value as a microsatellite unstable locus, and determining the microsatellite locus with the EMD value less than or equal to the Cut-off value as a stable locus of the microsatellite;

and/or, said Y is 5;

and/or screening to obtain stable microsatellite loci and unstable microsatellite loci, and calculating the unstable locus percentage of each sample in the training set, namely MSSAR;

and/or the calculation method of the unstable point percentage comprises the following steps: counting the percentage of unstable loci meeting the requirement that SEQ is more than or equal to a fourth threshold in the candidate microsatellite locus set S2 of a single sample;

and/or obtaining an optimal Cut-off value for judging whether the sample microsatellite is stable or not by adjusting the percentage of unstable sites;

and/or, the tumor sample comprises a tumor tissue sample, a body fluid sample, preferably, the body fluid sample comprises blood, pleural fluid, cerebrospinal fluid;

and/or the sample to be tested comprises a tumor tissue sample and a body fluid sample, preferably the body fluid sample comprises blood, pleural fluid and cerebrospinal fluid.

8. A microsatellite instability analysis apparatus comprising:

the data acquisition module is used for respectively acquiring sequencing data of at least one individual tumor sample and normal control sample;

a primary screening step of microsatellite loci, which is used for screening the microsatellite loci from the sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training module for selecting a microsatellite stabilized tumor sample and a microsatellite unstable tumor sample as a training set, and respectively calculating the tumor sample insertion length distribution Pi and the normal control sample insertion length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setjCalculating the bulldozing distance EMD value between the two distributions;

the microsatellite locus re-screening module is used for calculating the significance value of the EMD value of each microsatellite locus in the stable microsatellite tumor sample and the unstable microsatellite tumor sample in the training set, reserving the microsatellite locus Si with the significance value smaller than a first threshold value, and obtaining a candidate microsatellite locus set S2;

the microsatellite locus tertiary screening module is used for setting a Cut-off value for judging a microsatellite locus according to the EMD value of the microsatellite locus in the candidate microsatellite locus set S2 in a microsatellite stable tumor sample, and screening to obtain a stable microsatellite locus and an unstable microsatellite locus;

a microsatellite stability judging module for obtaining the sequencing data of the sample to be tested and the normal control sample from the same subject and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

9. An apparatus, comprising:

a memory for storing a program;

a processor for implementing the method of any one of claims 1 to 7 by executing a program stored by the memory.

10. A computer-readable storage medium, characterized by comprising a program which is executable by a processor to implement the method of any one of claims 1-7.

Technical Field

The invention relates to the field of gene detection, in particular to a microsatellite instability analysis method and a microsatellite instability analysis device.

Background

Microsatellites (microsatellites) are short tandem repeats distributed throughout the human genome, with single, double or multiple nucleotide repeats, repeating 10-50 or more times. Microsatellite Instability (MSI) is a condition known as Microsatellite Instability in tumor cells due to the insertion or deletion of repeat units in Microsatellite length compared to normal cells. Numerous studies have shown that MSI is caused by a defect in the occurrence of the mismatch repair (MMR) gene and is closely related to tumorigenesis. It is reported in the literature that the MSI-H (microsatellite high instability) phenomenon is present in about 15% of colorectal cancers, and that the pathogenesis, prognosis and sensitivity to drugs are different compared to colorectal cancers characterized by MSS (microsatellite stability). There was also a variable proportion of MSI-H in solid tumors other than colorectal cancer, and there was a significant difference in the response rate to Keytruda in solid tumors with different MSI status, and MSI-H was not applicable to 5-FU chemotherapy alone. Clinically, MSI has been used as an important molecular marker for prognosis and development of adjuvant treatment regimens for colorectal cancer and other solid tumors, and is applied to assist in screening of the Ringchi syndrome.

At present, the detection of MSI state by a multiplex fluorescence PCR-capillary electrophoresis method is an internationally accepted gold standard, and is jointly recommended by international well-known institutions such as NCCN, ASCO and the like. Meanwhile, the DNA of normal tissues and tumor samples of the same patient are extracted, the detection sites are amplified by adopting a multiplex fluorescence PCR method, the amplification products are detected by capillary electrophoresis, and the detection results of the two tissue sources are contrasted and analyzed by using professional software, so that the MSI state of the patient can be accurately classified. The current common PCR method only detects 5 MSI sites, when more than 2 sites are unstable, the MSI sites are judged as MSI-H, one site is unstable and judged as MSI-L, otherwise, the MSS is judged. However, in the conventional method, since there are few detection sites and the continuity of the determination value is poor, erroneous determination between MSS and MSI-L and between MSI-H and MSI-L is likely to occur.

With the rise of precise medicine, tumor patients often need to detect markers such as corresponding gene mutation and the like during diagnosis and medication guidance. With the rise of High-Throughput Sequencing (also called Next-Generation Sequencing, NGS), the application of the High-Throughput Sequencing in the related field of tumor precise treatment is mature, so that the High-Throughput Sequencing can be widely applied due to the characteristics of capability of detecting multiple gene sites at one time, capability of simultaneously evaluating tumor-related mutation loads, High flux and the like. When a patient needs to evaluate the gene mutation tumor mutation load and the MSI state at the same time, the collection of clinical samples for PCR detection besides NGS detection has higher requirements on the sample size, the cost is relatively higher, and the experimental steps are increased correspondingly.

Disclosure of Invention

The invention mainly solves the problems of the prior art that when the instability of the microsatellite is detected, clinical samples need to be collected again, the cost is high, the experiment steps are multiple, and the like.

According to a first aspect, there is provided in one embodiment a method of microsatellite instability analysis, comprising:

a data acquisition step, which is to respectively acquire sequencing data of at least one individual tumor sample and normal control sample;

a primary screening step of microsatellite loci, namely screening the microsatellite loci from sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training step, selecting a microsatellite stable tumor sample and a microsatellite unstable tumor sample as training sets, and respectively calculating the tumor sample insertion length distribution Pi and the normal control sample insertion length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setsjCalculating the bulldozing distance EMD value between the two distributions;

a step of screening the microsatellite loci again, which is to calculate the significance value of the EMD value of each microsatellite locus in the tumor samples with stable microsatellites and the tumor samples with unstable microsatellites in the training set, and reserve the microsatellite locus Si with the significance value smaller than a first threshold value to obtain a candidate microsatellite locus set S2;

a third screening step of the microsatellite loci, namely setting a Cut-off value for judging the microsatellite loci according to the EMD value of the microsatellite loci in the candidate microsatellite locus set S2 in a tumor sample with stable microsatellites, and screening to obtain stable microsatellite loci and unstable microsatellite loci;

a microsatellite stability judgment step, namely acquiring sequencing data of a sample to be detected and a normal control sample from the same individual source, and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

Compared with other NGS methods, the method does not need to design a probe for a specific MSI site in the probe design, has strong universality and only utilizes the covered probe in the detection range. The larger the detection range, the more MSI sites, and the higher the detection accuracy.

According to a second aspect, an embodiment provides a microsatellite instability analysis apparatus comprising:

the data acquisition device is used for respectively acquiring sequencing data of a tumor sample and a normal control sample from the same individual source, and the screening microsatellite data acquisition module is used for respectively acquiring sequencing data of the tumor sample and the normal control sample of at least one individual;

a primary screening step of microsatellite loci, which is used for screening the microsatellite loci from the sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training module for selecting a microsatellite stabilized tumor sample and a microsatellite unstable tumor sample as a training set, and respectively calculating the tumor sample insertion length distribution Pi and the normal control sample insertion length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setjCalculating the bulldozing distance EMD value between the two distributions;

the microsatellite locus re-screening module is used for calculating the significance value of the EMD value of each microsatellite locus in the stable microsatellite tumor sample and the unstable microsatellite tumor sample in the training set, reserving the microsatellite locus Si with the significance value smaller than a first threshold value, and obtaining a candidate microsatellite locus set S2;

the microsatellite locus tertiary screening module is used for setting a Cut-off value for judging a microsatellite locus according to the EMD value of the microsatellite locus in the candidate microsatellite locus set S2 in a microsatellite stable tumor sample, and screening to obtain a stable microsatellite locus and an unstable microsatellite locus;

a microsatellite stability judging module for obtaining the sequencing data of the sample to be tested and the normal control sample from the same subject and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

According to a third aspect, the invention provides, in one embodiment, an apparatus comprising:

a memory for storing a program;

a processor for implementing the method as described in the first aspect by executing the program stored by the memory.

In a fourth aspect, the invention provides a computer readable storage medium comprising a program executable by a processor to implement the method according to the first aspect.

The invention can select the microsatellite loci based on the existing NGS detection data analysis result, and directly analyze the MSI state of the sample. Meanwhile, the invention uses a plurality of microsatellite loci to analyze the state of the MSI, and overcomes the defects of poor detection continuity and easy misjudgment of a PCR method. The method has the characteristics of high sensitivity, high specificity, simplicity, convenience, no need of redundant experimental steps and the like. Compared with the existing MSI detection method based on NGS, the method does not need to specially design a capture probe of a microsatellite locus, the sensitivity and the specificity are obviously improved, and compared with MSISensor, the method has higher analysis accuracy.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a graph of ROC comparing the results of the method of example 1 and the gold standard method.

FIG. 3 is a ROC plot comparing the results of the NGS MSI analysis method MSISensor to the gold standard method.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

Because the NGS detection has the characteristic of wide coverage, especially in the detection of large Panel capture sequencing, whole exon sequencing, whole genome sequencing and the like, hundreds to millions of microsatellite loci can be covered, so that the condition of MSI can be analyzed by directly using the result data of the NGS.

The invention provides a new method, which can select the microsatellite loci based on the existing NGS detection data analysis result and directly analyze the MSI state of a sample. Meanwhile, the invention uses a plurality of microsatellite loci to analyze the state of the MSI, thereby solving the characteristics of poor detection continuity and easy misjudgment of a PCR method. The method has the characteristics of high sensitivity, high specificity, simplicity, convenience, no need of redundant experimental steps and the like. Compared with the existing MSI detection method based on NGS, the method does not need to specially design a capture probe of a microsatellite locus, and the sensitivity and the specificity are obviously improved.

Interpretation of terms

MS: microsatellites (microsatellites) are short tandem repeats distributed throughout the human genome, with single, double or multiple nucleotide repeats, repeating 10-50 or more times.

MSI: namely, microsatellite instability.

MSI-H: microsatellites are highly unstable.

MSS: the microsatellite is stable.

MSI-L: microsatellites are less unstable.

Panel: refers to a collection of genes and sites tested.

Bam: and storing the comparison information of each read in a file format after the NGS sequencing data is compared with the reference genome.

And (3) PCR: the polymerase chain reaction is a molecular biology technique for amplifying specific DNA fragments. The PCR process requires the participation of primers.

And (3) NGS: Next-Generation Sequencing, i.e., second Generation Sequencing or high throughput Sequencing.

And Read: i.e., reads, a contiguous DNA sequence, generated by sequencing, consisting of A, T, C, G four different bases, e.g., ATCCGTAGCTCACGGACG. In the double-end sequencing mode in the second-generation sequencing, sequencing is carried out before and after one DNA, and two obtained reads are paired with each other.

EMD: the dozer Distance, Earth Mover's Distance, is abbreviated as EMD. If the probability distribution is visualized as a mound, the distance indicates how much work needs to be done to move from probability distribution P1 to probability distribution P2. The index may be used to quantify the difference between the two probability distributions.

FDR (false discovery rate): the index for measuring the false discovery rate refers to the probability of false positives in all tests.

Cut-off: the critical value or the positive judgment value is the standard for judging whether the detection result is positive or negative.

In a first aspect, the present invention provides, in some embodiments, a method of microsatellite instability analysis comprising:

a data acquisition step, which is to respectively acquire sequencing data of at least one individual tumor sample and normal control sample;

a primary screening step of microsatellite loci, namely screening the microsatellite loci from sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training step, selecting a microsatellite stable tumor sample and a microsatellite unstable tumor sample as training sets, and respectively calculating the tumor sample insertion length distribution Pi and the normal control sample insertion length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setsjCalculating the bulldozing distance EMD value between the two distributions;

a step of screening the microsatellite loci again, which is to calculate the significance value of the EMD value of each microsatellite locus in the tumor samples with stable microsatellites and the tumor samples with unstable microsatellites in the training set, and reserve the microsatellite locus Si with the significance value smaller than a first threshold value to obtain a candidate microsatellite locus set S2;

a third screening step of the microsatellite loci, namely setting Cut-off values for judging the microsatellite loci according to the EMD values of the microsatellite loci in the candidate microsatellite locus set S2 in a tumor sample with stable microsatellites, and screening to obtain stable microsatellite loci and unstable microsatellite loci;

a microsatellite stability judgment step, namely acquiring sequencing data of a tumor sample to be detected and a normal control sample of the same subject, and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

In some embodiments, the sequencing data of the tumor sample and the normal control sample in the data acquisition step, the sequencing data of the sample to be tested in the microsatellite stability determination step and the sequencing data of the normal control sample can be obtained by existing sequencing methods, including but not limited to whole genome sequencing, exon sequencing, targeted capture sequencing, amplicon sequencing and the like.

In some embodiments, the calculation of the EMD value for each microsatellite locus is consistent with the training samples.

In some embodiments, the microsatellite stability determination step, wherein the test sample and the normal control sample are derived from the same subject, wherein the subject is different from the individual in the data acquisition step before the primary screening of the microsatellite loci (i.e., if the test sample is not derived from the same individual as the training sample), the subject can be an individual who has been clinically diagnosed as a tumor patient, or an individual who has not been clinically diagnosed. Tumor samples generally refer to samples derived from diseased sites, tissues or body fluids of tumor patients, such as cancerous tissue samples of colorectal cancer patients. The normal control sample, which may also be referred to as a control sample, generally refers to a sample of non-diseased parts, tissues or body fluids from the same tumor patient, for example, a leukocyte sample isolated from peripheral blood, paracancerous normal tissues, saliva, etc.

In some embodiments, the invention is applicable to the detection of all cancers. Preferably, the types of cancer to which the present invention is applied include, but are not limited to, intestinal cancer, lung cancer, stomach cancer, esophageal cancer, liver cancer, kidney cancer, melanoma, brain cancer, pancreatic cancer, urinary system tumor, leukemia, and lymphoma.

In some embodiments, the invention is applicable to circulating tumor DNA, hematologic tumor MSI detection, and also to detection of pleural fluid, cerebrospinal fluid, and like samples.

In some embodiments, the invention is applicable to high throughput sequencing data analysis of microsatellite instability in cancer-related gene detection for clinical diagnosis, prognosis, clinical treatment guidance, and the like.

The sample aimed by the invention is an isolated sample, is not a direct implementation object of a living human or animal body, and the unstable analysis result of the microsatellite is an intermediate result and is not a final disease diagnosis result, so the invention does not belong to the field of disease diagnosis and treatment methods.

In some embodiments, the invention may also be used for other non-diagnostic, non-therapeutic purposes, e.g., for screening existing drugs, new drug candidates, etc. for the treatment of cancer in scientific experiments.

In some embodiments, the normal control sample can be whole blood, more preferably peripheral blood or a peripheral blood cell fraction. As will be understood by those skilled in the art, a blood sample may include, but is not limited to, any portion or component of blood of T cells, monocytes, neutrophils, erythrocytes, platelets, and microvesicles (e.g., exosomes and exosome-like vesicles). In the context of the present disclosure, the blood cells contained in the blood sample encompass any nucleated cells and are not limited to components of whole blood. Thus, blood cells comprise, for example, White Blood Cells (WBCs). In some embodiments, a normal control sample may also be referred to as a normal sample, a control sample.

In some embodiments, the sequencing methods for each sample include, but are not limited to, high throughput sequencing methods such as whole genome sequencing, whole exome sequencing, or capture probe sequencing. In a preferred embodiment, the sequencing data for all samples may be obtained by whole exome sequencing.

In some embodiments, the genome secondary sequencing data of the tumor sample and the normal control sample are generally aligned first to the reference genome. Thus, in a preferred embodiment, the data acquisition step acquires an alignment file of the genomic second-generation sequencing data of the tumor sample and the normal control sample aligned to the reference genome.

The reference genome, for example, can be a standard genomic sequence of a reference of a species (e.g., human), e.g., hg19 as one of the versions of the human reference genome in one embodiment, and hg38 as one of the versions of the human reference genome in another embodiment.

In some embodiments, the tumor sample has a sequencing depth of > 200 ×, in other embodiments, the tumor sample has a sequencing depth of > 300 ×, in other embodiments, the tumor sample has a sequencing depth of > 400 ×, and in other embodiments, the tumor sample has a sequencing depth of > 500 ×.

In some embodiments, the sequencing depth of the normal control sample is > 50 x, in other embodiments the sequencing depth of the normal control sample is > 100 x, and in other embodiments the sequencing depth of the normal control sample is > 200 x.

In some embodiments, the primary screening for the microsatellite locus satisfies at least one of the following conditions:

1) the repetition times of the single base of the site is more than or equal to 6 times;

2) the repetition times of 2-4 bases of the site are more than or equal to 5 times.

The use of a set of microsatellite loci is not limited herein and can be searched according to different detection ranges.

In some embodiments, the method of screening said set of candidate microsatellite loci S1 comprises:

according to the sequencing data of the tumor sample and the normal control sample, calculating a mean value R1 of the ratio of the effective depth of each microsatellite locus to the average depth of the sample, calculating the comparison quality value MAPQ of each microsatellite locus to be less than X or the mean value R2 of the ratio of the reads with the multiple comparison phenomena to the total coverage reads of the locus, and removing the microsatellite loci with R1 less than a second threshold or R2> a third threshold to obtain the candidate microsatellite locus set S1.

The effective depth of each microsatellite locus is the number of reads covered by the locus after duplication of the de-duplication PCR.

The mean R1 of the ratios of the mean depths of the samples is the arithmetic mean of the ratios of the sites of the samples.

The comparison quality value MAPQ refers to a quality value in NGS comparison for measuring the accuracy of comparison results, the larger the value is, the better the value is, the value is calculated by comparison software, and the comparison software comprises BWA, bowtie and the like.

In some embodiments, said X is 30.

In some embodiments, the second threshold is 33%.

In some embodiments, the third threshold is 5%.

In some embodiments, after the initial screening to obtain the candidate microsatellite locus set S1, on the reference genome, the short sequence to the left of the starting position of the microsatellite locus is taken as the left primer LPrimer, and the short sequence to the right of the ending position is taken as the right primer RPrimer.

In a preferred embodiment, the lengths of the left primer LPrimer and the right primer RPrimer are not less than 5bp, and can include, but are not limited to, 6bp or more, 8bp or more, 10bp or more, 15bp or more, 20bp or more, 25bp or more, 30bp or more, 40bp or more, 50bp or more, and more preferably 5 bp.

In some embodiments, the lengths of the left primer LPrimer and the right primer RPrimer may be the same, may be different, and are preferably the same.

In some embodiments, when screening the candidate microsatellite locus set S1 for microsatellite stabilized tumor samples and microsatellite unstable tumor samples, the method of detecting sample stability may be an existing method including, but not limited to, multiplex fluorescence PCR-capillary electrophoresis, Immunohistochemistry (IHC) staining, and the like.

In some embodiments, when the stable state of the microsatellite in the sample is detected by multiplex fluorescence PCR-capillary electrophoresis, the sample with the detection result of PCR MSS (microsatellite stability) is used as the stable sample of the microsatellite, and the sample with the detection result of PCRMSI-H (high microsatellite instability) is used as the unstable sample of the microsatellite. MSI-L samples are not considered in the training set, and the result of PCR MSI-L samples in NGS detection is MSS.

In other embodiments, the Immunohistochemical (IHC) staining method detects microsatellite stability in a sample by using a sample with a pMMR (MMR gene normal) as a microsatellite stability sample and a dMMR (MMR gene deletion) as a microsatellite instability sample.

In some embodiments, each microsatellite locus S in said training set is calculatediDistribution Pi of tumor sample insertion lengths and distribution Q of normal control sample insertion lengthsjThe method comprises the following steps:

extraction of overlay microsatellite loci S from sequencing data of each tumor sampleiAnd the reads of the left primer LPrimer and the right primer RPrimer are spliced according to the comparison position to obtain a reduced DNA sequence SEQ, if the number of the sequences SEQ of the microsatellite loci in the tumor sample is less than a fourth threshold value, subsequent statistics are not included, and the step of calculating the distribution of the insertion length is skipped.

In some embodiments, the fourth threshold is 30. That is, microsatellite locus DNA molecule coverage above a fourth threshold is required, which in some embodiments may be above 30. If SEQ < fourth threshold, then the site is not deep enough to be included in the calculation.

In some embodiments, each microsatellite locus S in said training set is calculatediDistribution Pi of tumor sample insertion lengths and distribution Q of normal control sample insertion lengthsjThe method comprises the following steps:

calculating the distance between the left primer LPrimer and the right primer RPrimer of the sequence SEQ obtained by splicing to obtain the insertion length L, and obtaining the statistical distribution of the normalized insertion length, namely the microsatellite locus SiTumor sample insertion length distribution PiNormal control sample insertion length distribution QiAnd calculating an Earth Moving Distance (EMD) value between the two distributions, wherein the larger the EMD value is, the larger the instability probability of the site is.

In some embodiments, the dozing distance EMD value is calculated as follows:

Figure BDA0002566125510000081

P=(p1wp1), (pm, wpm), P having m lengths, wpi being the ratio of the lengths;

q ═ Q1, wq 1., (qn, wqn), Q has n lengths, wqj is the ratio of the lengths;

matrix [ d ]ij]Each item dijRepresents piAnd q isjLength difference of (d), matrix [ f)ij]Each term fijRepresents from piTo qjFind fij]Of (d) is determined* ij]Then, the EMD value is calculated.

In some embodiments, using the Hungarian Algorithm (Hungarian Algorithm), find [ f [/z ] fij]Of (d) is determined* ij]。

In some embodiments, each microsatellite locus in the training set is deleted before the difference between the EMD value significance P value in the microsatellite stable sample and the EMD value in the microsatellite unstable sample, wherein the microsatellite unstable locus in the training set sample which does not meet the fourth threshold value of SEQ is larger than the fifth threshold value.

In some embodiments, the fifth threshold is 10%. That is, deletions were low coverage MSI sites in more than 10% of the samples.

In some embodiments, the significance of the EMD value in the microsatellite stable and microsatellite unstable samples for each microsatellite locus in the training set is calculated using the T-test, which screens MSI-H, effectively identifying the assay locus. In some embodiments, the significance value is calculated using paired T-test.

In some embodiments, the significance value may be a P value.

In some embodiments, the first threshold is 0.05.

In some embodiments, in the microsatellite stability sample, according to the EMD value of the microsatellite loci in the candidate microsatellite locus set S2, the EMD value of the microsatellite loci is sorted in descending order, the Y-th EMD value (FDR < Y%) is used as a Cut-off value for judging whether the loci are unstable, the microsatellite loci with the EMD value being greater than the Cut-off value are judged as a microsatellite instability locus, and the microsatellite loci with the EMD value being less than or equal to the Cut-off value are judged as the microsatellite stability locus.

In some embodiments, the Y is 5.

In some embodiments, stable and unstable microsatellite loci are screened and the percentage of unstable loci, i.e., MSSAR, is calculated for each sample in the training set.

In some embodiments, the method of calculating the MSIratio comprises: and counting the percentage of unstable sites satisfying SEQ more than or equal to a fourth threshold in the candidate microsatellite site set S2 of a single sample.

SEQ is a DNA template sequence obtained by splicing genome position and sequence of a pair of read pair after de-duplication.

In some embodiments, the optimal Cut-off value for determining the instability of a sample microsatellite is obtained by adjusting the percentage of unstable sites.

In some embodiments, after determining the unstable state of each microsatellite locus in the sample to be tested, calculating the percentage of unstable loci, i.e., MSSAR, of the sample to be tested, which may also be referred to as the total MSSAR value.

In some embodiments, the optimal Cut-off value for determining the instability of the sample microsatellite is obtained by adjusting the percentage of unstable sites in the sample to be tested. Here, instead of giving a fixed cut-off value, the cut-off value for MSratio is derived from the training set.

In some embodiments, the tumor sample comprises a tumor tissue sample, a body fluid sample, preferably, the body fluid sample comprises, but is not limited to, blood, pleural fluid, cerebrospinal fluid.

In some embodiments, the sample to be tested comprises a tumor tissue sample, a body fluid sample, preferably, the body fluid sample comprises, but is not limited to, blood, pleural fluid, cerebrospinal fluid.

In a second aspect, the present invention provides, in some embodiments, a microsatellite instability analysis apparatus comprising:

the data acquisition module is used for respectively acquiring sequencing data of at least one individual tumor sample and normal control sample;

a primary screening step of microsatellite loci, which is used for screening the microsatellite loci from the sequencing data of at least one normal control sample to obtain a candidate microsatellite locus set S1;

a sample training module for selecting the tumor sample with stable microsatellite and the tumor sample with unstable microsatellite as training sets and respectively calculating the trainingTumor sample insert length distribution Pi and normal control sample insert length distribution Q of each microsatellite locus of each sample microsatellite locus set S1 in the training setjCalculating the bulldozing distance EMD value between the two distributions;

the microsatellite locus re-screening module is used for calculating the significance value of the EMD value of each microsatellite locus in the stable microsatellite tumor sample and the unstable microsatellite tumor sample in the training set, reserving the microsatellite locus Si with the significance value smaller than a first threshold value, and obtaining a candidate microsatellite locus set S2;

the microsatellite locus tertiary screening module is used for setting a Cut-off value for judging a microsatellite locus according to the EMD value of the microsatellite locus in the candidate microsatellite locus set S2 in a microsatellite stable tumor sample, and screening to obtain a stable microsatellite locus and an unstable microsatellite locus;

a microsatellite stability judging module for obtaining the sequencing data of the sample to be tested and the normal control sample from the same subject and referring to each microsatellite locus S in the training setiThe insertion length distribution calculating method comprises the steps of calculating the EMD value of each microsatellite locus, and judging whether each microsatellite locus is stable or unstable according to the Cut-off value.

In a third aspect, the present invention provides an apparatus comprising:

a memory for storing a program;

a processor for implementing the method as described in the first aspect by executing the program stored by the memory.

In a fourth aspect, the invention provides a computer readable storage medium comprising a program executable by a processor to implement the method according to the first aspect.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by way of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated on a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

In some embodiments, the flow chart of the present invention is shown in fig. 1, and the method for microsatellite instability analysis based on high throughput sequencing of the present invention comprises the steps of:

1. and (3) screening MS candidate sites with the single base repetition times of more than 6 times or the repetition times of 2 to 4 bases of more than 5 times in the detection interval range to obtain a candidate MS site set S1.

In some embodiments, the specific method of screening the set of MS sites S1 comprises:

the mean value R1 of the ratio of the effective depth of each MS locus to the mean depth of the sample was calculated from the sequencing data Bam file of 10 normal control samples. Calculating the average value R2 of the ratio of the read of each MS locus with the alignment quality value MAPQ <30 or the read with the multi-alignment phenomenon to the total coverage read of the locus. Removing R1< 33% or R2> 5% of MS sites to obtain a candidate MS site set S1. On a reference genome, a short sequence of 5bp on the left of the starting position of the MS locus is taken as a left primer LPrimer, and a short sequence of 5bp on the right of the end position is taken as a right primer RPrimer. The reference genome may be hg 19.

2. Tumor samples of 100 microsatellite stabilized (PCR MSS or IHC pMMR) and 30 microsatellite unstabilized (PCR MSI-H or IHC dMMR) were used as training sets.

The following is performed for each sample MS site set S1 in the training set for each site Si:

2.1 from the input NGS alignment result file (usually in BAM format), extract coverage sites Si and reads of SiLPrimer and Si RPrimer. And splicing matched reads in the reads according to the aligned positions to restore the sequence SEQ of the DNA, wherein if the number of the SEQ of the site in the sample is less than 30, the site in the sample is not counted, and the step 2.2-2.3 is skipped.

2.2 for each sequence SEQ spliced in the above manner, the distance between the two positions of LPrimer and RPrimer is calculated, called the insertion length L, and a statistical distribution of the normalized insertion lengths, called P, is obtainedL,PLThe total probability of (c) is 1.

2.3 for each tumor sample and normal paired sample, respectively using the above method to obtain the tumor sample insertion length distribution P of the site and the normal control sample insertion length distribution Q, calculating the dozing distance EMD value between the two distributions, wherein the larger the EMD value, the larger the instability probability of the site.

The EMD calculation method is as follows:

P=(p1,wp1),...,(pm,wpm) P has m lengths, wpi is the ratio of the lengths;

Q=(q1,wq1),...,(qn,wqn) Q has n lengths, wqj is the ratio of the lengths;

matrix [ d ]ij]Each item of dijRepresents piAnd q isjLength difference of (d), matrix [ f)ij]Each term of fijRepresents from piTo qjThe number of movements of (2).

Using Hungarian's Algorithm (Hungarian Algorithm), find [ fij]Of (d) is determinedij]Then, the EMD value is calculated:

Figure BDA0002566125510000111

for example, at BAT26 MS site, the length distribution P inserted in the tumor sample and the length distribution Q inserted in the normal control sample are:

P=(16,0.0056),(17,0),(18,0.0056),(19,0.0056),(20,0.0281),(21,0.0225),(22,0.1517),(23,0.2640),(25,0.3876),(26,0.0899),(27,0.0225),(28,0.0112),(29,0.0056);

q ═ Q (16,0), (17,0.0015), (18,0.0104), (19,0.0074), (20,0.0357), (21,0.0610), (22,0.1161), (23,0.2560), (25,0.3631), (26,0.1116), (27,0.0312), (28,0.0060), (29, 0). Substituting the above equation calculates the EMD value as 0.1279.

3. Deleting more than 10% of MSI sites in the training set samples which do not satisfy SEQ ≧ 30. The EMD value difference significance P value in the stable and unstable samples was calculated for each MS site using the T-test. MS sites with P value <0.05 were retained, resulting in a set of candidate microsatellite sites S2.

For example, at BAT25 MS site, only 5 samples which do not satisfy the condition of SEQ. gtoreq.30 among 130 training set tumor samples are present, the proportion is 3.8%, and the site is less than 10%, and the site is reserved. The p-value between the EMD of the stable and unstable samples of the microsatellite of the training set was calculated to obtain p-value 3.747e-13, which is less than 0.05, so that the BAT25 site was retained in S2.

4. For the candidate microsatellite locus set S2 locus, the EMD values of the locus are sorted in descending order in the MSS sample, and the EMD value of the 5 th% (FDR < 5%) is used as the Cut-off value (i.e. the critical value, also called the threshold value) for judging whether the locus is unstable. If the value is greater than the value, the site is judged to be unstable, and if the value is less than or equal to the value, the microsatellite site is stable.

For example, the EMD values of BAT25 sites in the stable sample of the microsatellite in the training set are sorted in a descending order, and the EMD value of the 5 th% site sample is 0.64, so that the site is unstable when the EMD value of the BAT25 site is greater than 0.64, and is stable when the EMD value is less than or equal to 0.64.

5. The MSSAR (percent unstable sites) of each sample in the training set was calculated by counting the percent unstable sites satisfying SEQ. gtoreq.30 in the single sample site set S2.

6. And obtaining a Cut-off value which is the best for judging the instability of the sample microsatellite by adjusting the MSI ratio.

In a training set sample, cut-off is judged by changing the MSSAR value as the instability of the sample microsatellite.

For example, when the MSIRate of the sample is greater than or equal to 20%, the microsatellite of the sample is considered unstable, and the MSIRate of less than 20% is considered stable, the sensitivity is 96.6% and the specificity is 100%, compared with the known answer, and an optimal value is reached.

7. And (3) repeating the steps 2.1 to 2.3 for a new bit set S2 in the sample to be detected to obtain the EMD value of each bit, then judging the unstable state of the microsatellite at the bit according to the EMD cut-off value of each bit obtained in the step 4, and calculating the total MSRate value of the sample according to the step 5. And (4) judging the MSI state of the sample by using the cut-off value obtained in the step (6).

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:互斥性约束图拉普拉斯的异质性癌症驱动基因识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!