Method and system for determining mutation rate of nucleic acid sample to be detected

文档序号:36681 发布日期:2021-09-24 浏览:26次 中文

阅读说明:本技术 确定待测核酸样本变异率的方法和系统 (Method and system for determining mutation rate of nucleic acid sample to be detected ) 是由 谢震 黄慧雅 廖微曦 曹玉冰 郭亚琨 于 2020-03-23 设计创作,主要内容包括:本发明提出了一种确定待测核酸样本变异率的方法。方法包括:对待检测核酸样本进行测序,以便获得测序结果;将所述测序结果与所述待测核酸样本的参考基因组序列进行比对,以便获得比对结果;基于匹配测序读段与所述建库片段平均长度,分别确定并校正结构变异、单核苷酸和/或小片段变异;将未匹配测序读段与其他物种基因组参考序列进行比对,并确定各其他物种来源与未知来源的测序读段比例;对所述未匹配测序读段进行拼接,并对所述拼接结果与所述待测核酸样本的参考基因组序列进行比对,以便确定外源变异,并基于所述拼接结果,确定可能的来源物种;将所述结构变异、所述单核苷酸和/或小片段变异、外源变异进行汇总,以便确定所述待测核酸样本的变异率。(The invention provides a method for determining the mutation rate of a nucleic acid sample to be detected. The method comprises the following steps: sequencing a nucleic acid sample to be detected so as to obtain a sequencing result; comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be detected so as to obtain a comparison result; respectively determining and correcting structural variation, single nucleotide variation and/or small fragment variation based on the average length of the matched sequencing reads and the library-building fragments; comparing the unmatched sequencing reads with the reference sequences of the genomes of other species, and determining the sequencing read proportion of the sources of the other species to the unknown sources; splicing the unmatched sequencing reads, comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation, and determining possible source species based on the splicing result; summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected.)

1. A method for determining the rate of variation of a nucleic acid sample to be tested, comprising:

(1) sequencing a nucleic acid sample to be detected so as to obtain a sequencing result, wherein the sequencing minimum effective depth is 10-100, the data volume of the sequencing result is determined based on the length of a reference genome, the minimum effective sequencing depth and a preset minimum variation rate of detectable variation, and the sequencing result is composed of a plurality of sequencing reads;

(2) comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be detected so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and determining the average length of the sequenced library building fragment based on the matched sequencing read;

(3) determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on the matching sequencing reads and the average length of the pooled fragments;

(4) splicing the unmatched sequencing reads, and comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation;

(5) summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected;

optionally, the test nucleic acid sample comprises a viral genome.

2. The method according to claim 1, wherein before performing step (2), the sequencing results are subjected to quality evaluation and screening in advance, and based on the screening results, a lowest variation rate of detectable variation is re-determined, and if the lowest variation rate of detectable variation is lower than a predetermined threshold value, the amount of the nucleic acid sample is increased in step (1).

3. The method of claim 1, wherein in step (3), the structural variation is determined using Pindel, and half of the predetermined lowest detectable variation rate is used as a variation rate screening threshold.

4. The method of claim 1, wherein after determining the structural, single nucleotide and/or small fragment variations, the sequencing reads involved in the variations are aligned twice, the detected variations of the same type are combined and false positive detections due to low quality bases, errors in alignment results are corrected.

5. The method of claim 1, wherein in step (3), the second alignment is performed using different software than the alignment in step (2).

6. The method of claim 1, wherein in step (3), common types of variants are excluded based on public data and historical detection data.

7. The method according to claim 1, wherein in step (3) the detection of single nucleotide and/or small fragment variations is performed using Mutect 2.

8. The method of claim 1, wherein the unmatched sequencing reads are further aligned to other species genomic reference sequences and the ratio of sequencing reads from each other species and unknown source is determined.

9. The method of claim 1, wherein the genome of the other species comprises a human genome and/or a mycoplasma genome.

10. The method according to claim 1, wherein in step (4), a possible source species is determined based on the splicing result.

11. The method of claim 1, further comprising performing PCR validation of the structural and/or exogenous variation.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for determining the mutation rate of a nucleic acid sample according to any one of claims 1 to 11.

13. An electronic device comprising a memory, a processor;

wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the method for determining the variation rate of the nucleic acid sample to be tested according to any one of claims 1 to 11.

14. A system for determining the rate of variation of a nucleic acid sample to be tested, comprising:

a sequencing device for sequencing a nucleic acid sample to be detected so as to obtain a sequencing result, wherein the sequencing minimum effective depth is 10-100, the data volume of the sequencing result is determined based on the length of a reference genome, the minimum effective sequencing depth and a predetermined detectable variation minimum variation rate, and the sequencing result is composed of a plurality of sequencing reads;

the comparison device is connected with the sequencing device and is used for comparing the sequencing result with the reference genome sequence of the nucleic acid sample to be detected so as to obtain a comparison result, the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library building fragment is determined based on the matched sequencing read;

the matching sequencing read analysis device is connected with the comparison device and is used for respectively determining and correcting structural variation, single nucleotide variation and/or small fragment variation based on the average length of the matching sequencing read and the library building fragment;

the unmatched sequencing read analyzing device is connected with the comparison device, splices the unmatched sequencing read and compares the spliced result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation;

an output device, connected to the matched sequencing read analysis device and the unmatched sequencing read analysis device, for summarizing the structural variation, the single nucleotide and/or small fragment variation, and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be tested;

optionally, the test nucleic acid sample comprises a viral genome.

15. The system of claim 14, further comprising:

and a lowest variation rate determining device, connected to the sequencing device and the alignment device, for performing quality evaluation and screening on the sequencing result in advance, and re-determining a detectable variation lowest variation rate based on the screening result, wherein if the detectable variation lowest variation rate is lower than a predetermined threshold, the amount of the nucleic acid sample is increased at the sequencing device, and if the detectable variation lowest variation rate is not lower than the predetermined threshold, the sequencing result is input to the alignment device.

16. The system of claim 14, further comprising a source of unmatched sequencing reads analyzing device, said source of unmatched sequencing reads analyzing device being connected to said alignment device for aligning said unmatched sequencing reads with the genomic reference sequences of the other species and determining the ratio of sequencing reads from each of the other species and the unknown source, the results being input to said output device.

17. The system of claim 14, further comprising a stitching result source analysis device coupled to the unmatched sequencing read analysis device for determining possible source species based on the stitching result, the result being input to the output device.

18. The system of claim 16, further comprising a PCR device, wherein the PCR device is connected to the matching sequencing read analysis device, the unmatched sequencing read analysis device, and the unmatched sequencing read source analysis device, and is configured to perform PCR verification on the structural variation and/or the exogenous variation, and the result is input to the output device.

Technical Field

The invention relates to the field of biological information, in particular to a method and a system for determining the mutation rate of a nucleic acid sample to be detected, a computer-readable storage medium and an electronic device.

Background

Viruses often have stable structures, simple genomes, broad-spectrum infection capacity and high-efficiency packaging capacity, and become widely used engineered DNA transport expression vectors. Instead, researchers have used inactivated, attenuated, or engineered viruses as effective vaccines, taking advantage of the immune properties of the virus itself. Furthermore, researchers have engineered viruses into oncolytic viruses that have the ability to replicate and package and specifically achieve tumor killing by exploiting the biological properties of the virus to lyse host cells during amplification. With the progressive research on virus, various viruses such as adenovirus, lentivirus, herpes simplex virus-1 and the like become the objects of engineering modification at present, and various virus products are applied to clinical treatment.

Although viruses have the above advantages as engineering vectors, the pathogenic ability and susceptibility to mutation of viruses also increase the risk of safety. The detection of exogenous pollution and self-variation is an important content of quality control in the modification and production process. In the case of adenovirus, the FDA requires that the level of replicative adenovirus (RCA) in non-replicative adenovirus be less than 1RCA/3e10 VP. At present, exogenous and variant fragments in a virus sample are mainly detected by a low-pass method for performing PCR detection and first-generation sequencing on a specific region, corresponding primer design needs to be performed on the fragments to be detected according to possible variant types, all exogenous and variant fragments in the sample are difficult to be completely covered, the PCR reaction specificity and length limitation are limited, and high-homology fragments and long fragments are difficult to detect. The deep sequencing technology can detect all fragments in a sample at high flux by randomly fragmenting the sample to be detected to build a library, covers rich neighborhood information around the fragment to be detected, and can effectively detect the exogenous and variant fragments of the virus sample by combining with a related analysis technology.

In conclusion, the detection of exogenous and variant fragments in a virus sample is an important content of quality control, but the conventional detection method still has the problems of low flux, incompleteness and difficulty in detecting high homology fragments and long fragments, and the inventor establishes a detection and analysis process from a sample to be detected to an analysis report based on a high-flux deep sequencing technology and a related analysis technology, and effectively and comprehensively detects the conditions of exogenous pollution and self-variant in the analysis sample.

Disclosure of Invention

The present application is based on the discovery and recognition by the inventors of the following facts and problems:

in the quality control of engineering virus modification and production, in order to detect the exogenous pollution and self-variation condition, the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on high-throughput deep sequencing and related analysis technology, and effectively and comprehensively detects the exogenous pollution and self-variation condition in the analysis sample.

In a first aspect, the present invention provides a method for determining the mutation rate of a nucleic acid sample to be tested. According to an embodiment of the invention, the method comprises: (1) sequencing a nucleic acid sample to be detected so as to obtain a sequencing result, wherein the sequencing minimum effective depth is 10-100, the data volume of the sequencing result is determined based on the length of a reference genome, the minimum effective sequencing depth and a preset detectable variation minimum variation rate, and the sequencing result is composed of a plurality of sequencing reads; (2) comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be detected so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and determining the average length of the sequenced library building fragment based on the matched sequencing read; (3) determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on the matching sequencing reads and the average length of the pooled fragments; (4) splicing the unmatched sequencing reads, and comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; (5) summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the invention, the test nucleic acid sample comprises a viral genome. The method provided by the embodiment of the invention can effectively and comprehensively detect and analyze the conditions of exogenous pollution and self variation in the nucleic acid sample to be detected.

According to an embodiment of the present invention, the method may further include at least one of the following additional technical features:

according to the embodiment of the invention, before the step (2) is carried out, the quality evaluation and the screening are carried out on the sequencing result in advance, the lowest variation rate of the detectable variation is determined again based on the screening result, and if the lowest variation rate of the detectable variation is lower than a predetermined threshold value, the amount of the nucleic acid sample is increased in the step (1).

According to an embodiment of the present invention, in step (3), the structural variation is determined using Pindel, and half of the predetermined lowest variation rate of detectable variation is used as a variation rate screening threshold.

According to an embodiment of the present invention, after determining the structural variation, the single nucleotide variation and/or the small fragment variation, the sequencing reads involved in the variation are aligned twice, the same type of detected variation is merged and false positive detection results due to low quality bases, errors in alignment results, etc. are corrected.

According to an embodiment of the present invention, in step (3), different software is used for the second alignment and the alignment in step (2).

According to the embodiment of the invention, in the step (3), common variation types are excluded according to the public data and the historical detection data.

According to an embodiment of the invention, in step (3), the detection of single nucleotide and/or small fragment variations is performed using Mutect 2.

According to an embodiment of the invention, in step (4), based on the stitching result, possible source species are determined.

According to embodiments of the invention, the unmatched sequencing reads are further aligned to other species of genomic reference sequences and the ratio of sequencing reads from each other species and unknown source is determined.

According to an embodiment of the invention, the genome of said other species comprises the human genome and/or the mycoplasma genome.

According to the embodiment of the invention, PCR verification is further carried out on the structural variation and/or the exogenous variation.

In a second aspect of the invention, the invention proposes a computer-readable storage medium having a computer program stored thereon. According to an embodiment of the present invention, the program is executed by a processor to implement the method for determining the mutation rate of a nucleic acid sample to be tested.

In a third aspect of the invention, an electronic device is presented. According to an embodiment of the present invention, the electronic device includes a memory, a processor; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the method for determining the variation rate of the nucleic acid sample to be detected.

In a fourth aspect, the present invention provides a system for determining the mutation rate of a test nucleic acid sample. According to an embodiment of the invention, the system comprises: a sequencing device for sequencing a nucleic acid sample to be tested so as to obtain a sequencing result, wherein the sequencing minimum effective depth is 10-100, the data volume of the sequencing result is determined based on the length of a reference genome, the minimum effective sequencing depth and a predetermined detectable variation minimum variation rate, and the sequencing result is composed of a plurality of sequencing reads; the comparison device is connected with the sequencing device and is used for comparing the sequencing result with the reference genome sequence of the nucleic acid sample to be detected so as to obtain a comparison result, the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library building fragment is determined based on the matched sequencing read; the matching sequencing read analysis device is connected with the comparison device and is used for respectively determining and correcting structural variation, single nucleotide variation and/or small fragment variation based on the average length of the matching sequencing read and the library building fragment; the unmatched sequencing read analyzing device is connected with the comparison device, splices the unmatched sequencing read and compares the spliced result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; and the output device is connected with the matched sequencing read analysis device and the unmatched sequencing read analysis device and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the present invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects and analyzes the conditions of exogenous pollution and self mutation in the sample.

According to an embodiment of the present invention, the system may further include at least one of the following technical features:

according to an embodiment of the invention, the system further comprises:

and a lowest variation rate determining device, connected to the sequencing device and the alignment device, for performing quality evaluation and screening on the sequencing result in advance, and re-determining a detectable variation lowest variation rate based on the screening result, wherein if the detectable variation lowest variation rate is lower than a predetermined threshold, the amount of the nucleic acid sample is increased at the sequencing device, and if the detectable variation lowest variation rate is not lower than the predetermined threshold, the sequencing result is input to the alignment device.

According to an embodiment of the present invention, the system further includes an unmatched sequencing read source analysis device, connected to the alignment device, for aligning the unmatched sequencing read with the reference sequence of the genome of the other species, and determining the ratio of the sequencing reads from the other species and the unmatched sequencing read from the unknown species, and inputting the result to the output device.

According to an embodiment of the present invention, the system further comprises a splicing result source analyzing device, the splicing result source analyzing device is connected to the unmatched sequencing read analyzing device, and is configured to determine a possible source species based on the splicing result, and the result is input to the output device.

According to an embodiment of the present invention, the system further comprises a PCR device, the PCR device is connected to the matching sequencing read analysis device, the unmatched sequencing read source analysis device and the unmatched sequencing read analysis device, and is configured to perform PCR verification on the structural variation and/or the exogenous variation, and the result is input to the output device.

Drawings

FIG. 1 is a schematic view of a virus variation detection assay;

FIGS. 2A-2F are simulation tests of the accuracy of different tools to detect deletion variations of different proportions and lengths;

FIGS. 3A-3F are simulation tests of the accuracy of different tools to detect flip variations of different proportions and different lengths;

FIGS. 4A-4F are simulation tests of the accuracy of different tools to detect insertion variations of different proportions and lengths;

FIGS. 5A-5F are simulation tests of the accuracy of different tools to detect copy number variations of different proportions and different lengths;

FIG. 6 is a process of detecting exogenous insertion/replacement mutation;

FIGS. 7A-7D are simulation tests based on the accuracy of splice detection of exogenous replacement variations of different proportions and different lengths;

FIGS. 8A-8C are experimental tests for the accuracy of detecting deletions and inversion variations of different lengths in different proportions;

FIG. 9 shows experimental testing and PCR validation of adenovirus samples;

FIG. 10 is a high resolution calibration procedure for the detection of a variation in Pindel;

FIG. 11 is a schematic diagram of a system for determining a variation rate of a nucleic acid sample according to an embodiment of the present invention;

FIG. 12 is a schematic diagram illustrating a system for determining a variation rate of a nucleic acid sample according to another embodiment of the present invention;

FIG. 13 is a schematic diagram illustrating a system for determining a variation rate of a nucleic acid sample according to another embodiment of the present invention;

FIG. 14 is a schematic diagram illustrating a system for determining a variation rate of a nucleic acid sample according to another embodiment of the present invention;

FIG. 15 is a block diagram of a system for determining a mutation rate of a test nucleic acid sample according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The invention provides a system for determining the mutation rate of a nucleic acid sample to be detected. According to an embodiment of the invention, with reference to fig. 11, the system comprises: the sequencing device 100 is used for sequencing a nucleic acid sample to be detected so as to obtain a sequencing result, the sequencing minimum effective depth is 10-100, the data volume of the sequencing result is determined based on the length of a reference genome, the minimum effective sequencing depth and a preset lowest variation rate of detectable variation, and the sequencing result is composed of a plurality of sequencing reads; an alignment device 200, connected to the sequencing device 100, for aligning the sequencing result with a reference genome sequence of the nucleic acid sample to be tested to obtain an alignment result, wherein the alignment result includes a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library creating fragments is determined based on the matched sequencing read; a match sequencing read analysis device 300, wherein the match sequencing read analysis device 300 is connected to the alignment device 200, and is configured to determine and correct a structural variation, a single nucleotide variation and/or a small fragment variation, respectively, based on the average lengths of the match sequencing read and the library building fragments; an unmatched sequencing read analysis device 400, wherein the unmatched sequencing read analysis device 400 is connected with the alignment device 200, splices the unmatched sequencing reads, and compares the spliced result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; an output device 500, wherein the output device 500 is connected to the matched sequencing read analysis device 300 and the unmatched sequencing read analysis device 400, and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation, and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be tested. According to an embodiment of the present invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects and analyzes the conditions of exogenous pollution and self mutation in the sample.

According to an embodiment of the present invention, the system may further include at least one of the following technical features:

according to an embodiment of the present invention, referring to fig. 12, the system further includes: a lowest variation rate determining device 600, wherein the lowest variation rate determining device 600 is connected to the sequencing device 100 and the alignment device 200, and is configured to perform quality evaluation and screening on the sequencing result in advance, and re-determine a detectable variation lowest variation rate based on the screening result, increase the amount of the nucleic acid sample at the sequencing device if the detectable variation lowest variation rate is lower than a predetermined threshold, and input the sequencing result into the alignment device if the detectable variation lowest variation rate is not lower than the predetermined threshold.

According to an embodiment of the present invention, referring to fig. 13, the system further includes an unmatched sequencing read source analysis device 700, where the unmatched sequencing read source analysis device 700 is connected to the alignment device 200, and is configured to align the unmatched sequencing read with the reference sequence of the genome of the other species, and determine the ratio of the sequencing reads from the other species and the unmatched sequencing read from the unknown species, and input the result to the output device 500.

According to an embodiment of the present invention, referring to fig. 14, the system further comprises a splicing result source analysis device 800, the splicing result source analysis device 800 is connected to the unmatched sequencing read analysis device 400, and is configured to determine a possible source species based on the splicing result, and the result is input to the output device 500.

According to an embodiment of the present invention, referring to fig. 15, the system further includes a PCR device 900, the PCR device 900 is connected to the matching sequencing read analysis device 300, the unmatched sequencing read source analysis device 700 and the unmatched sequencing read analysis device 400, and is configured to perform PCR verification on the structural variation and/or the exogenous variation, and the result is input to the output device 500.

In the quality control of engineering virus modification and production, in order to detect the exogenous pollution and self-variation condition, the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on high-throughput deep sequencing and related analysis technology, and effectively and comprehensively detects the exogenous pollution and self-variation condition in the analysis sample.

The specific process is as follows:

1) obtaining a reference genome sequence of the detected virus by means of literature research, first-generation sequencing and the like; after virus purification, high-quality virus genome DNA is extracted.

2) And performing library construction and sequencing on the virus extracted genome by a deep sequencing technology to obtain high-throughput sequencing data with enough depth. The minimum effective sequencing depth is an empirical value of 10-100, the total data amount required can be estimated by referring to the length of the genome sequence, the minimum effective sequencing depth and the preset minimum variation rate of the detectable variation,

3) evaluating sequencing quality by using sequencing data quality evaluation software (such as Fastqc), and mainly judging the base quality distribution and the joint pollution condition; preprocessing with sequencing data preprocessing software (such as Cutadapt), and selecting corresponding adaptor type, base quality threshold and sequence length threshold according to the quality evaluation result; after data preprocessing, quality evaluation is carried out again to confirm the preprocessing effect; and re-estimating the lowest variation rate of the detectable variation according to the total data quantity after the preprocessing, and if the lowest variation rate of the detectable variation is not reached, adding a sample.

4) Comparing the preprocessed data with the reference genome by using sequence comparison software (such as Bwa) to obtain a comparison result, and keeping a suboptimal comparison result; using alignment result processing software (e.g., sambolster) to de-duplicate the results and extract unmatched sequencing reads; the results are ranked and the average length of the pooled fragments is estimated using alignment results processing software (e.g., sammbambaba).

5) And performing structural variation analysis by using structural variation detection software (such as Pindel) based on the comparison result and the average length of the library-building fragments, selecting a proper detection variation length range according to the length of the reference genome sequence, and selecting half of the preset lowest variation rate of the detectable variation as a variation rate screening threshold.

6) And correcting the variation detection result and the detected variation rate by using high-resolution variation detection and correction software, merging the detected variations of the same type based on the re-comparison result of the detected variation related data, eliminating false positive detection results caused by low-quality bases, comparison result errors and the like, and eliminating common variation types according to public data and historical detection data.

7) Single nucleotide and small fragment variation analysis is performed based on the comparison results using single nucleotide and small fragment variation detection software (e.g., Mutect2), common variation types are excluded from published virus polymorphism data and historical detection data, and compared with structural variation detection results, with the goal of reducing false negatives, supplementing single nucleotide and small fragment variations that are not detected in structural variation detection, and correcting the estimated variation rates of single nucleotide and small fragment variations that are also detected in structural variation detection.

8) Respectively comparing the unmatched sequencing reads with possible pollution source genomes (such as human genomes and mycoplasma genomes), and counting the proportion of each pollution source and unknown sources in the sample; and re-estimating the total data amount of the detected virus source and the lowest detectable variation rate according to the proportion of the pollution source sequence, and adding a sample if the preset lowest detectable variation rate is not reached.

9) The unmatched sequencing reads are spliced using splicing software (e.g., Spades), and the kmer parameters are adjusted to obtain the best splicing rate and splicing length. Detecting exogenous replacement variation according to the exogenous segment splicing result by using replacement variation detection software, comparing the spliced segment with a virus reference genome, screening the spliced segment of which the half read length at both ends can be matched with the reference genome, analyzing possible exogenous insertion and replacement variation according to the matching position, and estimating the variation rate according to the reference genome data depth and the spliced segment data depth.

10) And (3) searching the spliced fragments by using sequence similarity searching software (such as Blast), and analyzing possible source species and gene information of the spliced fragments.

11) And designing a corresponding PCR experiment for verifying the detected structural variation and exogenous fragment variation, and recovering the PCR fragment for first-generation sequencing verification.

12) And synthesizing the analysis results to generate a final analysis report.

The flow diagram of the invention is shown in figure 1.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Example 1 simulation test of virus variation detection analysis procedure

Experiment one

Simulation test of accuracy of detecting deletion, inversion, insertion and copy number variation of various lengths by various tools

Randomly generating a 40000bp sequence as a reference genome sequence in each simulation, and generating 40000 sequencing data of Illumina PE150 as non-variant sequencing data by using an Art simulation library building process; respectively randomly generating deletion variation with the lengths of 1 bp, 10bp, 100bp, 200bp and 1000bp in a reference genome sequence, and generating 40000 sequencing data of Illumina PE150 as the sequencing data after variation by using an Art simulation library building process; samples of the sequencing data were pooled, both without and after mutation, at mutation rates of 0.1, 0.2, 0.5, respectively, and the resulting mutations were detected using tools, Mutect2, FreeBaies, Pindel, Delly, Gridss, Lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool in detecting deletion variants of different lengths.

Randomly generating a 40000bp sequence as a reference genome sequence in each simulation, and generating 40000 sequencing data of Illumina PE150 as non-variant sequencing data by using an Art simulation library building process; respectively randomly generating turnover variation with the lengths of 10bp, 100bp, 200bp and 1000bp in a reference genome sequence, and generating 40000 sequencing data of Illumina PE150 as the sequencing data after variation by using a tool Art simulation library building process; samples of the sequencing data were pooled, both without and after mutation, at mutation rates of 0.1, 0.2, 0.5, respectively, and the resulting mutations were detected using tools, Mutect2, FreeBaies, Pindel, Delly, Gridss, Lumpy, respectively. And repeating the simulation for 200 times, and evaluating the accuracy of detecting the overturning variation with different lengths by each tool.

Randomly generating a 40000bp sequence as a reference genome sequence in each simulation, and generating 40000 sequencing data of Illumina PE150 as non-variant sequencing data by using an Art simulation library building process; respectively randomly generating insertion variation with the lengths of 1 bp, 10bp, 100bp and 200bp in a reference genome sequence, and generating 40000 sequencing data of Illumina PE150 as the sequencing data after variation by using a tool Art simulation library building process; samples of the sequencing data were pooled, both without and after mutation, at mutation rates of 0.1, 0.2, 0.5, respectively, and the resulting mutations were detected using tools, Mutect2, FreeBaies, Pindel, Delly, Gridss, Lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool in detecting insertion variations of different lengths.

Randomly generating a 40000bp sequence as a reference genome sequence in each simulation, and generating 40000 sequencing data of Illumina PE150 as non-variant sequencing data by using an Art simulation library building process; randomly generating copy number variations of 2X and 3X with lengths of 25bp, 50bp, 100bp, 200bp and 1000bp in a reference genome sequence, and generating 40000 pair of Illumina PE150 sequencing data as the varied sequencing data by using an Art simulation library building process; the sequencing data samples after the non-variation and the variation were mixed at variation rates of 0.1, 0.2, 0.5, respectively, and the resulting variations were detected using tools Mutect2, Pindel, Delly, Gridss, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool in detecting copy number variations of different lengths.

The detection results are shown in FIGS. 2A-2F, 3A-3F, 4A-4F, and 5A-5F. Mutect2 and FreeBaies are tools for detecting single nucleotide variation and small fragment insertion deletion variation, deletion, insertion and inversion variations with a maximum length of 10bp can be detected in a test, Mutect2 can detect copy number 2X variation with a maximum length of 50bp, and the variation rate evaluated by the tools is close to the actual variation rate. Delly, Lumpy and Gridss are tools for detecting structural variation, deletion and turnover variation with the minimum length of 100bp can be detected in a test, and Gridss can also detect insertion variation with partial lengths of 100 and 200bp because of the local splicing function of Gridss. Delly can detect 2X copy number variation with a minimum length of 200 bp. Gridss can detect 3X copy number variation with a minimum length of 25 bp. Neither Delly, Lumpy nor Gridss were able to assess the rate of variation. Pindel can detect variations of various types and lengths, and the variation rate estimated by the tool is close to the actual variation rate. The 30X coverage corresponding to the simulation data variation rate of 0.1 is taken as a detection limit, and compared with the simulation data with the variation rate of 0.2 or 0.5, the performance of each tool is kept consistent. In conclusion, Pindel can comprehensively detect variations of various types and lengths and can be used as a main tool for virus variation detection; mutect2 and FreeBayes can be used as a supplement to single nucleotide variation and small fragment insertion deletion variation detection; longer length exogenous insertion variations cannot be detected with tools based on known map alignments, and need to be detected by splicing-based tools.

Experiment two

Simulation test accuracy of detecting exogenous fragment variation with various lengths by various tools

Randomly generating a 40000bp sequence as a reference genome sequence in each simulation, and generating 40000 sequencing data of Illumina PE150 as non-variant sequencing data by using an Art simulation library building process; respectively randomly generating substitution variation with deletion lengths of 0, 200, 500, 1000 and 10000bp and insertion lengths of 200, 500, 1000 and 10000bp in a reference genome sequence, and generating 40000 sequencing data of Illumina PE150 as the varied sequencing data by using an Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at mutation rates of 0.1, 0.2, 0.5, respectively, and the unmatched sequencing reads were spliced using the tool Spades to detect the resulting mutations. The exogenous insertion/replacement mutation detection process is shown in FIG. 6. The simulation is repeated for 200 times, and the accuracy of detecting the substitution variation of the inserted deletion fragments with different lengths by the Spades splicing tool is evaluated.

The detection results are shown in FIGS. 7A to 7D. For the replacement variation of the inserted deletion fragments with different lengths, the Spads tool can splice out exogenous fragments and accurately compare the exogenous fragments to the position where the replacement variation occurs, and the variation rate evaluated by the tool is close to the actual variation rate. The 30X coverage corresponding to the simulation data variation rate of 0.1 is taken as a detection limit, and compared with the simulation data with the variation rate of 0.2 or 0.5, the detection accuracy is consistent. This shows that for long exogenous fragment insertion/substitution variation, the Spads tool based on splicing can accurately detect exogenous fragments and the occurrence positions of variation.

Example 2 Experimental testing for adenovirus of the viral mutation detection assay protocol

Experiment one

Accuracy of detecting deletion and turnover variation of various lengths by experimental test virus variation detection and analysis process

Constructing adenovirus packaging plasmid with the length of 40000bp as an invariant vector; introducing deletions and flips of different lengths at different positions on the non-mutated vector, as shown in table 1; mixing the non-mutated vector and the mutated vector at mutation rates of 0.001, 0.01 and 0.1 respectively, and carrying out Illumina PE150 deep sequencing at a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis process, comparing with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis process.

The detection results are shown in FIGS. 8A to 8C. For the deletion and the turnover variation of different positions and different lengths, the virus variation detection analysis process can accurately detect, the coverage of about 30X corresponding to the actual variation rate of 0.001 is taken as the detection limit, and the detection accuracy is consistent compared with the variation rate of 0.01 or 0.1.

Table 1:

type (B) Length (bp) Starting position (bp)
Variation A Deleting 15 1177
Variation B Deleting 347 1837
Variation C Deleting 1042 2907
Variation D Roll-over 12 6084
Variation E Roll-over 320 5047
Variation F Roll-over 2079 4504

Experiment two

Experimental testing virus variation detection analysis flow for detecting virus sample SYN

Extracting a virus sample SYN genome and carrying out Illumina PE150 deep sequencing by using a sequencing depth 1G; detecting variation in each sample by a virus variation detection analysis process, and verifying the detection result by PCR and first-generation sequencing reaction.

And (3) analyzing the sequencing result to detect the exogenous fragment 1 and the exogenous fragment 2. Corresponding PCR primers are designed respectively, the glue running result is shown in figure 9, and the length and the position of the detected exogenous fragment are consistent. Recovering PCR fragment for one-generation sequencing, and the sequence is identical to that of detected exogenous fragment.

Example 3Pindel detection variation high resolution correction flow test

Experiment one

Experimental test detection result of variation detected by Pindel high-resolution correction process corrected virus sample SYN2

The high resolution correction process for the variation detected by Pindel is shown in fig. 10. Screening the results of the Pindel which detects that the variation depth is more than 10, and counting 254 types; after high-resolution correction, 34 kinds of unstable variation of the microsatellite and 8 kinds of other variation are detected together, wherein 2 kinds of variation are combined variation; compared with the historical detection results, the unstable variation of the microsatellite is the existing variation, 5 of other variations are the existing variation, and 3 variations are newly detected. The results are shown in Table 2. The accuracy of the Pindel detection result is effectively improved by the visible correction process

Table 2:

example 4 Lentiviral Experimental testing of the viral variation detection assay protocol

Experiment one

Experimental test virus variation detection analysis process for detecting lentivirus sample

Constructing a lentiviral vector with a certain length segment as an invariant vector; introducing deletion and turnover variation with different lengths at different positions on an invariant vector; mixing an unmodified vector and a variant vector at a variation rate of 0.01 respectively, and carrying out Illumina PE150 deep sequencing at a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis process, comparing with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis process.

And for the detected deletion and turnover variation with different lengths at different positions, the virus variation detection and analysis process can be accurately detected, and compared with the variation rate of 0.01, the detection accuracy is consistent.

Example 5 Experimental testing of adeno-associated viruses for the viral mutation detection assay protocol

Experiment one

Adeno-associated virus sample detection by experimental test virus variation detection analysis process

Constructing adeno-associated virus vectors with certain length segments as non-variant vectors; introducing deletion and turnover variation with different lengths at different positions on an invariant vector; mixing an unmodified vector and a variant vector at a variation rate of 0.01 respectively, and carrying out Illumina PE150 deep sequencing at a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis process, comparing with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis process.

And for the detected deletion and turnover variation with different lengths at different positions, the virus variation detection and analysis process can be accurately detected, and compared with the variation rate of 0.01, the detection accuracy is consistent.

Example 6 Experimental testing of herpes Simplex Virus for the Virus mutation detection assay protocol

Experiment one

Herpes simplex virus sample detection by experimental test virus variation detection analysis process

Constructing a herpes simplex virus vector with a certain length segment as an invariant vector; introducing deletion and turnover variation with different lengths at different positions on an invariant vector; mixing an unmodified vector and a variant vector at a variation rate of 0.01 respectively, and carrying out Illumina PE150 deep sequencing at a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis process, comparing with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis process.

And for the detected deletion and turnover variation with different lengths at different positions, the virus variation detection and analysis process can be accurately detected, and compared with the variation rate of 0.01, the detection accuracy is consistent.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

34页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种同时鉴别胚胎染色体结构异常和致病基因携带状态的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!