Classification unit component calculation method of sequencing data

文档序号：1289193 发布日期：2020-08-28 浏览：8次中文

阅读说明：本技术 一种测序数据的分类单元组分计算方法 (Classification unit component calculation method of sequencing data ) 是由梁忱胡龙吴苏生杨帆肖念清任用于 2020-05-12 设计创作，主要内容包括：本发明涉及一种测序数据的分类单元组分计算方法。本发明基于“测序读出序列的次分类单元频率”指标及其计算框架,用于衡量序列比对结果中分类单元误比对的情况,能够有效去除分类单元组分计算中的假阳性结果,提高组分计算的特异性和准确性。同时,本发明还通过剔除异常分类单元后重新统计的策略实现了误比对序列向真实组分结果的回归,有效校正了分类单元丰度的定量结果。(The invention relates to a method for calculating a classification unit component of sequencing data. The method is based on the 'frequency of the secondary classification unit of the sequencing read sequence' index and a calculation framework thereof, is used for measuring the condition of the misclassification comparison of the classification units in the sequence comparison result, can effectively remove the false positive result in the component calculation of the classification units, and improves the specificity and the accuracy of the component calculation. Meanwhile, the invention also realizes the regression of the mis-aligned sequence to the real component result by a strategy of re-counting after the abnormal classification unit is removed, and effectively corrects the quantitative result of the abundance of the classification unit.)

1. A sequencing data generation and analysis method is characterized by comprising the following steps:

step 1), sequencing data comparison;

step 2), grouping according to classification units;

step 3), counting the frequency of the secondary classification unit of the sequencing read sequence;

the step 2) is to perform sequencing read sequence grouping according to the classification units preferentially supported by the comparison result based on the comparison result in the step 1);

the step 3) of grouping the sequencing read sequences in the step 2), and counting the frequency of the secondary classification units of each group of sequencing read sequences;

preferably, the statistical steps of the secondary classification unit frequency are as follows: and for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one group of sequencing read sequences, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification units of the group of sequencing read sequences.

2. The sequencing data generation analysis method of claim 1, wherein in step 1), sequence alignment of sequencing read sequences is performed using alignment software that retains non-specific alignment results; preferably, the software for alignment is BLASTN.

3. A method for reducing false positive taxon results in sequencing data bioassay, said method comprising the steps of any one of claims 1 or 2 and further comprising:

step 4), false positive classification unit elimination step;

step 4), comparing the frequency value of the secondary classification unit of each group with a frequency threshold of the secondary classification unit, and if the frequency value of the secondary classification unit of each group is greater than the frequency threshold of the secondary classification unit, determining the classification unit preferentially supported by the sequencing read sequence of the group as an abnormal classification unit and removing the abnormal classification unit; and the elimination is to eliminate all comparison results of the abnormal classification units compared in the original comparison result file after all the abnormal classification units of the sample are obtained.

4. A method of taxon component computation of sequencing data, the method comprising the steps of any of claims 1 to 3, and further comprising:

step 5), carrying out abundance statistics on the classification units;

and 5), grouping the comparison result obtained after the abnormal classification unit is removed in the step 4) again according to the classification unit preferentially supported by the comparison result, and counting the ratio of the sequence number of each group to the total read sequence number.

5. The method of any one of claims 1-4, wherein the sequencing data is from a second generation sequencing platform or a third generation sequencing platform; preferably, from Illumina, ION torment, PacBio, Roche, helicoos, ABI or nanopore sequencing platforms; more preferably, from a nanopore sequencing platform.

6. The method of any one of claims 1 to 5, wherein the sequencing data is genomic sequencing data; preferably, metagenomic sequencing data; more preferably, it is urinary infection metagenomic sequencing data.

7. The method of any of claims 1-6, wherein the secondary taxon frequency threshold is 15-30%; preferably, it is 20%.

8. A system for reducing false positive taxon results in sequencing data generation analysis, the system comprising:

module 1), a sequencing data alignment module;

module 2) grouping modules by classification unit;

module 3), a sub-taxon frequency statistics module for sequencing read sequences;

module 4), false positive classification unit exclusion module;

the module 1) performs sequence comparison on the sequencing read sequence by adopting comparison software which retains the non-specific comparison result; preferably, the software is BLASTN software;

the module 2) performs sequencing read sequence grouping according to the classification units preferentially supported by the comparison result based on the comparison result obtained by the module 1);

the module 3) is used for counting the frequency of the secondary classification unit of each group of sequencing read sequences aiming at the sequencing read sequence group of the module 2); preferably, the statistical steps of the secondary classification unit frequency are as follows: for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one group of sequencing read sequences, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification unit of the group of sequencing read sequences;

the module 3) is used for the following specific steps: for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one group of sequencing read sequences, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification unit of the group of sequencing read sequences;

the module 4) compares the frequency value of the secondary classification unit of each group with a frequency threshold of the secondary classification unit, and if the frequency value of the secondary classification unit of each group is greater than the frequency threshold of the secondary classification unit, the classification unit preferentially supported by the sequencing read sequence of the group is determined as an abnormal classification unit and is removed; and the elimination is to eliminate all comparison results of the abnormal classification units compared in the original comparison result file after all the abnormal classification units of the sample are obtained.

9. A taxon component computing system for sequencing data, the system comprising the system of claim 8 and further comprising:

module 5), a taxon abundance statistics module;

the module 5) groups the sequencing read sequence according to the classification unit preferentially supported by the comparison result again for the comparison result after the abnormal classification unit is removed, and counts the sequence number of each group and the proportion of the sequence number occupied in the total read sequence number.

10. The system of claim 9 or 10, wherein the sequencing data is derived from urinary infection metagenomic sequencing data.

Technical Field

The invention relates to the field of letter generation analysis, in particular to a method for calculating classification unit components of sequencing data.

Technical Field

Infectious diseases are diseases caused by pathogenic microorganisms, and have various infection sources and a plurality of diseases, thereby bringing great influence on public health of countries all over the world. According to the data of the world health organization, 2016 for example, only lower respiratory tract infections cause about 300 million deaths worldwide. At the same time, the problem of antibiotic abuse due to blind treatment of infectious diseases is also becoming increasingly serious. And accurate detection of infectious pathogens is one of the most important ways to solve the above problems.

The traditional means for detecting infectious disease pathogens is microbial culture, but the culture has the defects of long detection time and low sensitivity. The polymerase chain reaction (hereinafter referred to as PCR) method has short detection time and high sensitivity, but only can detect one pathogen at a time. Pathogen detection based on a sequencing technology directly performs sequencing analysis on all DNA of a sample, and has the characteristics of wide detection range and high sensitivity.

Nanopore sequencing technology is a new generation of sequencing technology that has emerged in recent years. The nanopore sequencing technology makes up the disadvantages of the second-generation sequencing platform, the reading length of a sequencing fragment is one to two orders of magnitude higher than that of the second-generation sequencing, and the library building and sequencing time are short. In addition, the sequencing equipment is small and portable, data can be obtained in real time and can be analyzed subsequently, and the limitation of a sequencing site and the delay of report feedback are well solved. Therefore, the technology is very suitable for the application of detecting infectious microbial pathogens. The species component calculation flow of the conventional nanopore sequencing in the technical field is as follows:

1. using ONT MinKNOW software to collect original sequencing data in real time in the sequencing operation process;

2. converting the original electric signal data by using ONT Albacore or ONT Guppy software to generate a base sequence;

3. host sequence removal based on hg38 human reference genome was performed using Minimap2 software;

4. using What's In My Pot? (WIMP) software calculates species composition and finally performs species abundance filtering.

The species component calculation process using WIMP software comprises the following steps:

1. sequence alignment was performed using Centrifuge software;

2. judging the species of each sequencing read sequence according to the comparison condition of each sequencing read sequence;

3. counting the sequencing read sequence number of each species supporting the species, and calculating the absolute abundance and the relative abundance of the species;

4. species results are user-defined abundance filtered (e.g., using a relative abundance threshold of 1%).

However, the conventional analysis method for sequencing data has the defect of high false positive (low specificity) of species results, and has great influence on the accuracy of pathogen results. How to reasonably remove species false positive introduced in the sequence alignment process is a technical problem to be solved urgently in the prior art.

The invention is provided in view of the above.

Disclosure of Invention

The core problem to be solved by the invention is how to remove the false positive classification unit result introduced in the sequence comparison process as much as possible by a data analysis method. In the sequence alignment process of sequencing data, because a certain proportion of similar sequences exist between genomes of related adjacent taxa, a sequencing read sequence derived from a certain taxon may be mis-aligned to genomes of other adjacent taxa, thereby causing errors in taxon component calculation. In the face of the phenomenon of misalignment, if the existence of a certain classification unit is determined only by sequentially evaluating the alignment condition of each sequencing read sequence, the false positive result is partially preserved, and the invention initiatively adopts a calculation frame containing the whole-based analysis of the alignment condition to determine the authenticity of the classification unit component result.

The existing classification unit component calculation method only uses a strategy of abundance screening (for example, classification units with the relative abundance of less than 1 percent are removed) to remove negative false positives, and does not construct an active strategy for judging false positive classification units introduced by misalignment by evaluating the overall distribution rule of alignment results.

The present invention takes into account that the problem of mis-alignment arises from similar sequences that exist between genomes of related, adjacent taxa. Thus, true positive alignments are mostly not derived from similar sequences between taxa, while false positive alignments are mainly derived from similar sequences between taxa. Then the overall statistical difference between the true positive and false positive alignments can be reflected by some index or combination of indexes.

Based on the above principle, the present invention firstly finds that if the sequencing read sequence is divided into different groups by using the preferentially aligned taxon as a unit, the specific alignment ratio of the sequencing read sequence of the true positive taxon group is relatively high, and the specific alignment ratio of the sequencing read sequence of the false positive taxon group is relatively low. Later, through data exploration, the invention finds that the index of the frequency of the sub-classification unit of the sequencing read sequence which has the same principle as the index of the specific alignment ratio has better discrimination (see figure 2). Therefore, the invention constructs a set of calculation method containing the taxon frequency of the sequencing read sequence and a quantification method for measuring the false positive taxon result in the analysis result of sequence alignment based on the index of the sub-taxon frequency of the sequencing read sequence. The screening method of the classification unit level can effectively remove false positive classification unit results in the component calculation of the metagenome classification unit, and improves specificity and accuracy. Finally, the invention realizes the regression of the sequencing read sequence subjected to error comparison to the real component result by a strategy of 'removing the abnormal classification unit and then carrying out statistics again', thereby effectively correcting the quantitative result of the abundance of the classification unit while improving the result specificity of the classification unit.

Therefore, a first object of the present invention is to provide a taxon component calculation method of sequencing data and a system thereof.

The second purpose of the invention is to provide a method and a system for reducing the false positive classification unit result in sequencing data generation analysis.

Based on the above purpose, the invention provides the following technical scheme:

the invention provides a sequencing data generation and analysis method, which is characterized by comprising the following steps:

step 1) sequencing data comparison;

step 2) grouping according to classification units;

step 3) counting the frequency of the secondary classification unit of the sequencing read sequence;

in some embodiments, step 1) comprises performing a sequence alignment on the sequencing reads using alignment software that retains the results of the non-specific alignment, preferably the software is BLASTN software.

In some embodiments, said step 2), based on the comparison results of step 1), the sequencing reads are grouped according to the taxa preferentially supported by the comparison results, i.e., the taxa preferentially supported by each group of sequencing reads are the same.

In some embodiments, said step 3), for the sequencing read sequence grouping of step 2), counting the sub-taxon frequency for each group of sequencing read sequences. In some embodiments, the specific steps of step 2) are as follows: for each group of sequencing read sequences, finding out all mutually exclusive classification units (such as a set of species to be compared) compared by at least one sequencing read sequence of the group, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the sequencing read sequences of the group, wherein the second largest value of the percentages is the frequency of the sub-classification units of the sequencing read sequences of the group.

The invention provides a method for reducing false positive classification unit results in sequencing data generation analysis, which is characterized by comprising the following steps:

step 1) sequencing data comparison;

step 2) grouping according to classification units;

step 3) counting the frequency of the secondary classification unit of the sequencing read sequence;

step 4), false positive classification unit elimination step;

In some embodiments, in step 4), the frequency value of the sub-classification unit in each group is compared with a frequency threshold of the sub-classification unit, if the frequency value is greater than the frequency threshold of the sub-classification unit, the classification unit preferentially supported by the sequencing read sequence in the group is determined to be an abnormal classification unit and is removed, and the removal is to remove all comparison results (alignment) of the abnormal classification unit in the original comparison result file after all abnormal classification units of the sample are obtained.

In some embodiments, the sequencing data is derived from urinary infection metagenomic sequencing data.

In some embodiments, the secondary taxon frequency threshold is 15-30%, preferably 20%.

In some embodiments, the secondary taxon frequency threshold may also be calculated by: using a certain amount of samples as a training set, and confirming the true positive and false positive results in the conventional biological information analysis by comparing the traditional culture and/or PCR identification results; performing the credit generation analysis again, and dividing the sequencing read sequence of each sample into different groups according to the classification units preferentially supported by the comparison result, namely each group of sequencing read sequences preferentially supports the same classification unit; counting the frequency of the classification unit compared with each group of read sequences, and obtaining the frequency of the secondary classification unit of the group of sequencing read sequences; and counting the frequency of the secondary classification units of each group of sequencing read sequences, which are true positive in the traditional culture and/or qPCR result, of the classification units supported preferentially, and then counting the frequency of the secondary classification units of each group of sequencing read sequences, which are false positive in the culture result or the qPCR result, of the classification units supported preferentially, so as to obtain a threshold value capable of distinguishing the two.

The invention also provides a system for reducing the false positive classification unit result in sequencing data generation analysis, which is characterized by comprising the following modules:

module 1) a sequencing data alignment module;

module 2) grouping modules by classification unit;

module 3) a sub-taxon frequency statistics module for sequencing read sequences;

module 4) false positive classification unit exclusion module;

in some embodiments, the module 1) performs sequence alignment on the sequencing reads using alignment software that retains the results of the non-specific alignment, preferably the software is BLASTN software.

In some embodiments, the module 2) groups the sequencing reads according to the taxa preferentially supported by the alignment result based on the alignment result obtained in module 1), i.e., the taxa preferentially supported by each group of sequencing reads are the same.

In some embodiments, the module 3) counts the sub-taxon frequency for each group of sequencing read sequences of the sequencing read grouping of module 2).

In some embodiments, module 3) performs the specific steps of: and for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one sequencing read sequence of the group, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification units of the group of sequencing read sequences.

In some embodiments, the module 4) compares the frequency value of the sub-taxon unit in each group with a frequency threshold of the sub-taxon unit, and if the frequency value of the sub-taxon unit is greater than the frequency threshold of the sub-taxon unit, determines that the taxon unit preferentially supported by the sequencing read sequence in the group is an abnormal taxon unit and eliminates the abnormal taxon unit, wherein the elimination is to eliminate all comparison results (alignment) of the abnormal taxon unit in the original comparison result file after all the abnormal taxon units of the sample are obtained.

In some embodiments, the sequencing data is derived from urinary infection metagenomic sequencing data.

In some embodiments, the secondary taxon frequency threshold is 15-30%, preferably 20%.

In some embodiments, the method or module for reducing false positive taxon results in sequencing data generating analysis is directed to sequencing data from a second generation sequencing platform or a third generation sequencing platform; preferably, from Illumina, ION torment, PacBio, Roche, helicoos, ABI or nanopore sequencing platforms; more preferably, from a nanopore sequencing platform.

In some embodiments, the method or module sequencing data for reducing false positive taxon results in sequencing data generating analysis is genome sequencing data; preferably metagenomic sequencing data; more preferably, it is urinary infection metagenomic sequencing data.

The invention also provides a method for calculating the classification unit components of sequencing data, which comprises the following steps:

step 1) sequencing data comparison;

step 2) grouping according to classification units;

step 3) counting the frequency of the secondary classification unit of the sequencing read sequence;

step 4), false positive classification unit elimination step;

and 5) carrying out abundance statistics on the classification units.

In some embodiments, step 1) comprises performing a sequence alignment on the sequencing reads using alignment software that retains the results of the non-specific alignment, preferably the software is BLASTN software.

In some embodiments, said step 3), for the sequencing read sequence grouping of step 2), counting the sub-taxon frequency for each group of sequencing read sequences. In some embodiments, the specific steps of step 2) are as follows: and for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one sequencing read sequence of the group, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification units of the group of sequencing read sequences.

In some embodiments, step 5), the alignment results after the abnormal taxon is eliminated are regrouped according to the taxon preferentially supported by the alignment results, and the sequence number of each subgroup (i.e., the absolute abundance of the taxon) and the proportion of the total read sequence number occupied by the subgroup (i.e., the relative abundance of the taxon) are counted.

In some embodiments, the sequencing data is derived from urinary infection metagenomic sequencing data.

In some embodiments, the threshold of the secondary taxon frequency in step 4) may be an empirical value known in the art for a particular sample type, typically 15-30%, preferably 20%;

in some embodiments, the secondary taxon frequency threshold may also be calculated by: using a certain amount of samples as a training set, and confirming the true positive and false positive results in the conventional biological information analysis by comparing the traditional culture and/or PCR identification results; performing the credit generation analysis again, and dividing the sequencing read sequence of each sample into different groups according to the classification units preferentially supported by the comparison result, namely each group of sequencing read sequences preferentially supports the same classification unit; counting the frequency of the classification unit compared with each group of read sequences, and obtaining the frequency of the secondary classification unit of the group of sequencing read sequences; and counting the frequency of the secondary classification units of each group of sequencing read sequences, which are true positive in the traditional culture and/or qPCR result, of the classification units supported preferentially, and then counting the frequency of the secondary classification units of each group of sequencing read sequences, which are false positive in the culture result or the qPCR result, of the classification units supported preferentially, so as to obtain a threshold value capable of distinguishing the two.

The invention also provides a system for calculating the classification unit components of sequencing data, which is characterized by comprising the following modules:

module 1) a sequencing data alignment module;

module 2) grouping modules by classification unit;

module 3) a sub-taxon frequency statistics module for sequencing read sequences;

module 4) false positive classification unit exclusion module;

module 5) Classification unit abundance statistics module

In some embodiments, the module 1) performs sequence alignment on the sequencing reads using alignment software that retains the results of the non-specific alignment, preferably the software is BLASTN software.

In some embodiments, the module 3) counts the sub-taxon frequency for each group of sequencing read sequences of the sequencing read grouping of module 2).

In some embodiments, the module 3) performs the specific steps of: and for each group of sequencing read sequences, finding out all mutually exclusive classification units which are compared by at least one sequencing read sequence of the group, and for the found classification unit set, respectively calculating the percentage of the number of the support sequences of each classification unit in the group of sequencing read sequences to the total number of the group of sequencing read sequences, wherein the second largest value in the percentages is the frequency of the secondary classification units of the group of sequencing read sequences.

In some embodiments, step 4) compares the frequency value of the sub-taxon of each group with a frequency threshold of the sub-taxon, and if the frequency value of the sub-taxon is greater than the frequency threshold of the sub-taxon, determines the taxon preferentially supported by the sequencing read sequence of the group as an abnormal taxon and eliminates the abnormal taxon; preferably, the removing is to remove all comparison results (alignment) of the abnormal classification units in the original comparison result file after all the abnormal classification units of the sample are obtained.

In some embodiments, the sequencing data is derived from urinary infection metagenomic sequencing data.

In some embodiments, the threshold of the secondary taxon frequency in step 4) may be an empirical value known in the art for a specific sample type, typically 15-30%, preferably 20%.

In some embodiments, the secondary taxon frequency threshold may also be counted by: using a certain amount of samples as a training set, and confirming the true positive and false positive results in the conventional biological information analysis by comparing the traditional culture and/or PCR identification results; performing the credit generation analysis again, and dividing the sequencing read sequence of each sample into different groups according to the classification units preferentially supported by the comparison result, namely each group of sequencing read sequences preferentially supports the same classification unit; counting the frequency of the classification unit compared with each group of read sequences, and obtaining the frequency of the secondary classification unit of the group of sequencing read sequences; and counting the frequency of the secondary classification units of each group of sequencing read sequences, which are true positive in the traditional culture and/or qPCR result, of the classification units supported preferentially, and then counting the frequency of the secondary classification units of each group of sequencing read sequences, which are false positive in the culture result or the qPCR result, of the classification units supported preferentially, so as to obtain a threshold value capable of distinguishing the two.

In some embodiments, the module 5) groups the sequencing read sequences according to the taxa preferentially supported by the alignment result again for the alignment result after the abnormal taxa is eliminated, and counts the sequence number of each group (i.e., the absolute abundance of the taxa) and the proportion of the total read sequences occupied by the group (i.e., the relative abundance of the taxa).

In some embodiments, the sequencing data in the taxon component calculation methods or modules described above is from a second generation sequencing platform or a third generation sequencing platform; preferably, from Illumina, ION torment, PacBio, Roche, helicoos, ABI or nanopore sequencing platforms; more preferably, from a nanopore sequencing platform.

In some embodiments, the taxon component calculation method or module sequencing data is genomic sequencing data; preferably metagenomic sequencing data; more preferred is urinary infection metagenomic sequencing data.

The invention has the beneficial technical effects that:

1. the invention provides a novel confidence generation analysis method, which is an improvement on the conventional species component calculation method, creatively provides a method for screening species based on the overall statistics of sequence alignment results.

2. The method solves the problem of removing false positive classification unit results introduced by error comparison which is difficult to solve by the conventional species component calculation method for the first time by introducing the calculation of the frequency of the secondary classification unit of the sequencing read sequence, and effectively improves the accuracy and specificity of pathogen detection.

3. The calculation framework is independent of the selection of a specific sequencing platform, is suitable for sequencing data of multiple platforms such as a second generation sequencing technology and a third generation sequencing technology, and can be applied to detection samples from different sources or different species.

Drawings

FIG. 1: calculating ideal conditions and actual conditions of classification unit components through sequence comparison;

FIG. 2: using 36 urinary test samples to explore the discrimination condition of the 'specific comparison ratio' and the 'secondary classification unit frequency' on the results of the true positive classification unit and the false positive classification unit;

FIG. 3: the species detection result of the conventional method is consistent with the culture and qPCR verification (the absolute abundance threshold is 100 sequences);

FIG. 4: the species detection result of the method is consistent with the culture and qPCR verification (the absolute abundance threshold is 100 sequences);

FIG. 5: the species detection result of the conventional method is consistent with the culture and qPCR verification (the absolute abundance threshold is 200 sequences);

FIG. 6: the species detection result of the method is consistent with the culture and qPCR verification (the threshold of absolute abundance is 200 sequences).

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by manufacturers, and are all conventional products available on the market.

Definition of partial terms

Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present invention are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.

As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.

Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.

The term "about" in the present invention denotes an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes 10% of the soil, preferably 5% of the soil, deviating from the indicated value.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The following terms or definitions are provided only to aid in understanding the present invention. These definitions should not be construed to have a scope less than understood by those skilled in the art.

The term "sequencing read sequence" in the present invention: the English language "read" or "reads" refers to a nucleic acid sequence or a set of nucleic acid sequences read by a sequencing platform.

The term "alignment result" in the present invention: the english term "alignment" refers to the correspondence between a sequencing read sequence and a reference sequence, and a sequencing read sequence can have multiple alignment results at the same time.

The term "classification unit" in the present invention: the english expression "taxon" refers to a group of organisms sharing some common characteristics, such as Protozoa (Protozoa), Primates (Primates), staphylococcus aureus (staphylococcus aureus), Salmonella enterica subsp. Different taxa may have different classification levels (e.g., protozoa corresponds to the classification level "phylum", primates corresponds to the classification level "order", staphylococcus aureus corresponds to the classification level "kind", salmonella enterica subspecies corresponds to the classification level "subspecies"), or may have the same classification level (e.g., staphylococcus aureus, staphylococcus epidermidis, staphylococcus haemolyticus, staphylococcus hominis all correspond to the classification level "kind"). Species, strains are two specific classes of taxa, which are also the preferred classes of interest for the taxa component calculation method of the present invention.

The term "species" in the present invention: the english language is "species", a species is a special class of taxa, which refers to a group of organisms that can mate and reproduce offspring.

The term "mutually exclusive" in the present invention means that two taxa A and B are arbitrarily selected from the group of taxa, such that the taxa A neither contains nor is contained by the taxa B. For example, the three taxonomic units "E.coli, Salmonella enterica, and Klebsiella pneumoniae" are mutually exclusive; the two classification units of Klebsiella pneumoniae and Klebsiella pneumoniae are not mutually exclusive.

The terms "a sequencing read sequence aligned to a taxon", "a sequencing read sequence supports a taxon" in the present invention: meaning that the alignment of the sequencing read sequence includes the reference sequence from the taxon.

The terms "a sequencing read sequence preferentially aligns to a taxon", and "a sequencing read sequence preferentially supports a taxon": the result indicates that the taxon is the highest alignment score among all mutually exclusive taxons corresponding to the alignment result of the sequencing read sequence. The method for judging the classification unit with the highest comparison score is to group the comparison results of the sequencing read sequence according to the respective corresponding classification units, compare the comparison scores and sum the comparison scores, and judge the classification unit with the highest comparison score sum as the classification unit with the highest comparison score.

The term "specific alignment" in the present invention: all the reference sequences corresponding to all the alignment results of a sequencing read sequence are from the same taxon.

The term "non-specific alignment" in the present invention: refers to a sequence read sequence alignment results simultaneously contains from two or more mutually exclusive classification unit reference sequence.

The term "mis-alignment" in the present invention: means that the alignment of the sequence read from a taxon actually contains a reference sequence from another taxon that does not contain or is not contained by the taxon. It should be noted that, the alignment result here is computationally error-free, the base identity rate can be very high, but the aligned taxon is not in fact consistent with the source taxon of the sample. Such mis-alignments generally occur between taxa in close proximity to genomic relatedness.

The term "taxon frequency (taxon frequency) of a taxon in a set of sequencing reads" in the present invention: refers to the proportion of the total number of sequences in a given set of sequencing reads that support the taxon.

The term "minor taxon frequency of a set of sequencing reads" in the present invention: the second highest taxon frequency in a group of sequencing read sequences in which the taxon preferentially supported by each sequencing read sequence is the same. That is, if the taxa preferentially supported by each sequence in a set of sequencing read sequences are the same, all taxa aligned by at least one sequencing read sequence in the set are searched, and for the set of taxa found, the percentage of the number of the supported sequences of each taxon in the set of sequencing read sequences to the total number of the set of sequencing read sequences is calculated, wherein the highest value is necessarily the taxon frequency (100%) corresponding to the taxa preferentially supported by the set of sequences, and the second highest value in all percentages is the minor taxon frequency (minor taxon frequency) of the set of sequencing read sequences.

The term "false positive classification unit" in the present invention: refers to a taxon whose taxon component calculation is positive, but which is not actually present in the sample. One special case of a "false positive taxon" is a "false positive species".

The term "abnormality classification unit" in the present invention: refers to taxa that are identified as abnormal by the "frequency of secondary taxa in sequencing read sequences" indicator in the methods of the invention. One special case of an "abnormal taxon" is an "abnormal species".

"method for calculating the taxon component of sequencing data" in the present invention: the method is preferably used for measuring the miscomparison condition in the sequence comparison result based on the index of the frequency of the secondary classification unit of the sequencing read sequence so as to obtain the classification unit component condition of the sequencing data, and the method can effectively remove the false positive result in the classification unit component calculation. It can be understood that the invention solves the problem of removing false positive classification units introduced by miscomparison, which is difficult to solve by the conventional classification unit component calculation method, by introducing the calculation of the frequency of the secondary classification units of the sequencing read sequence, and effectively improves the specificity and accuracy of pathogen detection; meanwhile, the calculation framework of the invention is independent of the selection of a specific sequencing platform, is not limited by the sequencing platform, and can be suitable for sequencing data of various platforms such as a second generation sequencing technology, a third generation sequencing technology and the like, and the invention is only preferred to a nanopore sequencing platform; the computational framework of the present invention is directed to the mis-alignment of any homologous sequence, and thus, the sequence source is not limited to the application of the present invention, and other genomic or genetic data sources are equally suitable for the present invention, in addition to the preferred metagenomic data source of the present invention, as will be understood in the art.

The technical idea of the invention as a whole is explained as follows by way of example, but not by way of limitation:

1) sequence alignment the original alignment of the sequencing read sequence was obtained using alignment software that retained the results of the non-specific alignment:

to achieve the need to retain all the results of the non-specific alignment, the present invention uses a Megablast method, such as BLASTN software, for sequence alignment.

2) Calculation of taxon frequency for sequencing reads:

after the sequences are compared, for a group of sequencing read sequences, counting the proportion of the number of the sequencing read sequences of each aligned classification unit in all the mutually exclusive classification units compared by the group of sequencing read sequences to the total number of the sequencing read sequences.

Examples are: if there are 4 sequencing reads, the original alignment results are: coli, Klebsiella pneumoniae and Klebsiella aerogenes were specifically compared in items 1 and 2, Klebsiella pneumoniae was specifically compared in item 3, and E.coli, Klebsiella pneumoniae and Klebsiella aerogenes were simultaneously compared in item 4. Then the frequencies of the taxa of E.coli, Klebsiella pneumoniae, and Klebsiella aerogenes in these 4 sequencing reads were 75%, 50%, and 25%, respectively.

3) Calculation of the sub-taxon frequency of the sequencing read sequence:

when a set of sequencing read sequences with the same preferentially supported taxons is given, the largest taxon frequency of all the mutually exclusive taxons compared with the sequencing read sequence is the taxon frequency of the preferentially supported taxon which is 100%, and the second highest frequency in the taxon frequencies corresponding to the mutually exclusive taxons is the sub-taxon frequency of the sequencing read sequence. According to practical experience, if the taxa preferentially supported by the set of sequencing read sequences are true positives, the specific alignment is generally higher and the frequency of secondary taxa is generally lower.

4) False positive taxa were judged based on the "secondary taxa frequency of sequencing read sequence" index:

the sequencing read sequences of the sample are divided into different groups according to the situation of the preferentially supported taxons, namely the preferentially supported taxons of each group of sequencing read sequences are the same. The sub-taxon frequency was then counted for each set of sequencing read sequences. If the frequency value of the secondary taxon of the group of sequences is greater than the threshold (the threshold calculation method is described below), then it is determined to be an abnormal taxon.

5) Calculating a secondary classification unit frequency threshold value:

the secondary taxon frequency threshold may be an empirical value known in the art, such as 15-30%, preferably around 20%. The present invention recognizes that for different types of infectious diseases, the sub-taxon frequency threshold may vary somewhat due to differences in pathogen classes, and the above-described method may be used to determine the sub-taxon frequency threshold in advance for different types of diseases. Illustratively, the calculation may be performed by: and using a certain number of samples as a training set, and obtaining the classification unit identification result of the samples by a culture method. The taxon component results of the sample are obtained using conventional component calculation methods. And identifying the classification unit result which is inconsistent with the conventional credit production result in the culture result by a qPCR method, and finding out true positive and false positive results in the credit production result. And analyzing the data again, and dividing the sequencing read sequence of each sample into different groups according to the classification units preferentially supported by the comparison result, namely each group of sequencing read sequences preferentially supports the same classification unit. Then, the taxon frequency of the taxon aligned with each set of read sequences is counted, and the sub-taxon frequency of the set of sequencing read sequences is obtained. And counting the frequency of the secondary taxons of each group of sequencing read sequences of which the taxons supported by the priority support are true positives in the culture result or the qPCR result, counting the frequency of the secondary taxons of each group of sequencing read sequences of which the taxons supported by the priority support are false positives in the culture result or the qPCR result, and then obtaining a threshold value capable of distinguishing the two sequences to the greatest extent.

6) And (3) counting again after the abnormal classification units are removed, realizing regression of the mis-aligned sequences to real component results, and effectively correcting the quantitative result of the abundance of the classification units:

and after the abnormal classification unit list of the sample is obtained through analysis, all comparison results of the abnormal classification units in the original comparison result file are removed. Note that the elimination is performed in units of alignment results (alignment), not in units of sequencing read sequences. And then carrying out abundance statistics on all the classification units again to obtain a classification unit component calculation result.

The invention is illustrated below with reference to specific examples.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于量子计算机的全量子分子模拟方法

Classification unit component calculation method of sequencing data

相关技术

网友询问留言