Macrogenomics-based pathogenic microorganism detection method and device

文档序号:1923599 发布日期:2021-12-03 浏览:24次 中文

阅读说明:本技术 一种基于宏基因组学的病原微生物检测方法及装置 (Macrogenomics-based pathogenic microorganism detection method and device ) 是由 盖伟 李瑞琳 关尚京 于 2021-11-03 设计创作,主要内容包括:本发明公开了一种基于宏基因组学的病原微生物检测方法及装置,包括:获取待检测样本的宏基因组测序数据;对宏基因组测序数据进行预处理,得到目标数据;对目标数据进行筛选,得到目标序列;对目标序列进行聚类分析,获得待测样本的候选物种类别;将目标数据与非冗余参考基因集进行比对,并计算每个基因在单个样本中的丰度,得到待测样本的目标物种分类信息;将目标数据与病原微生物可检测数据库中的信息进行比对,获得待测样本的耐药基因和毒性元件信息;将目标物种分类信息、耐药基因和毒性元件信息,确定为待检测样本的检测结果。本发明提升了病原微生物检测适用性范围和病原检测准确性。(The invention discloses a method and a device for detecting pathogenic microorganisms based on metagenomics, which comprise the following steps: acquiring metagenome sequencing data of a sample to be detected; preprocessing the metagenome sequencing data to obtain target data; screening target data to obtain a target sequence; performing cluster analysis on the target sequence to obtain candidate species categories of the sample to be detected; comparing the target data with the non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected; comparing the target data with information in a pathogenic microorganism detectable database to obtain drug resistance genes and toxic element information of the sample to be detected; and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected. The invention improves the detection applicability range and the detection accuracy of pathogenic microorganisms.)

1. A pathogenic microorganism detection method based on metagenomics is characterized by comprising the following steps:

acquiring metagenome sequencing data of a sample to be detected;

preprocessing the metagenome sequencing data to obtain target data, wherein the target data is the metagenome sequencing data meeting target quality conditions;

screening the target data to obtain a target sequence;

performing cluster analysis on the target sequence to obtain a candidate species category of the sample to be detected;

comparing the target data with a non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected;

comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected;

and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected.

2. The method of claim 1, wherein the pre-processing the metagenomic sequencing data to obtain target data comprises:

filtering the metagenome sequencing data to obtain a high-quality sequence;

removing the host sequencing sequence in the high-quality sequence, removing the redundant sequence and obtaining the removed sequence;

and comparing the removed sequence with a reference sequence to obtain target data.

3. The method of claim 2, further comprising:

and if the length of the removed sequence is smaller than a preset length threshold value, splicing the removed sequence to obtain a spliced sequence.

4. The method of claim 1, wherein said screening said target data to obtain target sequences comprises

Determining the length of an open reading frame, and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

filtering the initial sequence which has a stop codon in the middle of the sequence in the initial sequence and is provided with a difference value of translation initiation coordinates of the two overlapped initial sequences which is not a multiple of three to obtain a filtered sequence;

and removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

5. The method of claim 1, wherein the performing cluster analysis on the target sequence to obtain candidate species classes of the test sample comprises:

acquiring absolute position information of the reading code of each target sequence;

splicing the target sequences based on the absolute position information, and combining the spliced target sequences into a gene vector matrix;

generating a gene characteristic self-learning solver according to the gene vector matrix, and obtaining an optimal solution of the learning rate;

and performing gene prediction according to the optimal learning rate solution to obtain the candidate species category of the sample to be detected.

6. A pathogenic microorganism detection apparatus based on metagenomics, comprising:

the acquisition unit is used for acquiring the metagenome sequencing data of the sample to be detected;

the preprocessing unit is used for preprocessing the metagenome sequencing data to obtain target data, and the target data is the metagenome sequencing data meeting target quality conditions;

the screening unit is used for screening the target data to obtain a target sequence;

the analysis unit is used for carrying out clustering analysis on the target sequence to obtain the candidate species category of the sample to be detected;

the calculation unit is used for comparing the target data with a non-redundant reference gene set and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected;

the comparison unit is used for comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected;

and the determining unit is used for determining the target species classification information, the drug resistance genes and the toxic element information as the detection result of the sample to be detected.

7. The apparatus of claim 6, wherein the pre-processing unit comprises:

the first filtering subunit is used for filtering the metagenome sequencing data to obtain a high-quality sequence;

a first removal subunit, configured to remove a host sequencing sequence from the high-quality sequence, remove a redundant sequence, and obtain a removed sequence;

and the comparison subunit is used for comparing the removed sequence with a reference sequence to obtain target data.

8. The apparatus of claim 7, further comprising:

and the splicing subunit is used for splicing the removed sequence to obtain a spliced sequence if the length of the removed sequence is smaller than a preset length threshold.

9. The apparatus of claim 6, wherein the screening unit comprises

The identifier unit is used for determining the length of the open reading frame and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

a second filter subunit, configured to filter an initial sequence in which a stop codon exists in the middle of the sequence in the initial sequence and a difference between translated start coordinates of two overlapping initial sequences is not a multiple of three, so as to obtain a filtered sequence;

and the second removal subunit is used for removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

10. The apparatus of claim 6, wherein the analysis unit comprises:

the acquisition subunit is used for acquiring absolute position information of the code reading of each target sequence;

the sequence splicing subunit is used for splicing the target sequences based on the absolute position information and combining the spliced target sequences into a gene vector matrix;

the generating subunit is used for generating a gene characteristic self-learning solver according to the gene vector matrix and obtaining an optimal solution of the learning rate;

and the predictor unit is used for performing gene prediction according to the learning rate optimal solution to obtain the candidate species category of the sample to be detected.

Technical Field

The invention relates to the technical field of biology, in particular to a method and a device for detecting pathogenic microorganisms based on metagenomics.

Background

Microorganisms are widely present in nature, mostly unicellular organisms. Microorganisms typically include viruses, bacteria, fungi, protozoa, and certain algae, among others. In the aspect of medical application, the rapid detection of pathogenic microorganisms in clinical samples has important clinical significance for diagnosis, treatment and prognosis of infectious diseases.

Microbiology research has developed rapidly in the past decades, and especially the application of Metagenomic Sequencing (Metagenomic Next-Generation Sequencing) to the detection of pathogenic microorganisms plays a great role. The next generation sequencing can realize the diagnosis and inquiry of diseases and the tracing of infectious diseases. There are a number of disadvantages of current nanopore sequencing: more sequencing errors, lower throughput, higher average sequencing cost per base, thus limiting the application of nanopore sequencing. Compared with nanopore sequencing, the mNGS sequencing has very obvious advantages for pathogen detection.

However, the existing method for detecting pathogenic microorganisms by utilizing metagenome sequencing has the problems of long detection time, low accuracy, narrow luminosity and incapability of detecting unknown infectious pathogens.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for detecting pathogenic microorganisms based on metagenomics, which improve the detection applicability range and the detection accuracy of pathogenic microorganisms.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for detecting pathogenic microorganisms based on metagenomics, comprising:

acquiring metagenome sequencing data of a sample to be detected;

preprocessing the metagenome sequencing data to obtain target data, wherein the target data is the metagenome sequencing data meeting target quality conditions;

screening the target data to obtain a target sequence;

performing cluster analysis on the target sequence to obtain a candidate species category of the sample to be detected;

comparing the target data with a non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected;

comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected;

and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected.

Optionally, the preprocessing the metagenomic sequencing data to obtain target data includes:

filtering the metagenome sequencing data to obtain a high-quality sequence;

removing the host sequencing sequence in the high-quality sequence, removing the redundant sequence and obtaining the removed sequence;

and comparing the removed sequence with a reference sequence to obtain target data.

Optionally, the method further comprises:

and if the length of the removed sequence is smaller than a preset length threshold value, splicing the removed sequence to obtain a spliced sequence.

Optionally, the screening the target data to obtain a target sequence includes

Determining the length of an open reading frame, and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

filtering the initial sequence which has a stop codon in the middle of the sequence in the initial sequence and is provided with a difference value of translation initiation coordinates of the two overlapped initial sequences which is not a multiple of three to obtain a filtered sequence;

and removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

Optionally, the performing cluster analysis on the target sequence to obtain a candidate species category of the sample to be tested includes:

acquiring absolute position information of the reading code of each target sequence;

splicing the target sequences based on the absolute position information, and combining the spliced target sequences into a gene vector matrix;

generating a gene characteristic self-learning solver according to the gene vector matrix, and obtaining an optimal solution of the learning rate;

and performing gene prediction according to the optimal learning rate solution to obtain the candidate species category of the sample to be detected.

A pathogenic microorganism detection apparatus based on metagenomics, comprising:

the acquisition unit is used for acquiring the metagenome sequencing data of the sample to be detected;

the preprocessing unit is used for preprocessing the metagenome sequencing data to obtain target data, and the target data is the metagenome sequencing data meeting target quality conditions;

the screening unit is used for screening the target data to obtain a target sequence;

the analysis unit is used for carrying out clustering analysis on the target sequence to obtain the candidate species category of the sample to be detected;

the calculation unit is used for comparing the target data with a non-redundant reference gene set and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected;

the comparison unit is used for comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected;

and the determining unit is used for determining the target species classification information, the drug resistance genes and the toxic element information as the detection result of the sample to be detected.

Optionally, the pre-processing unit comprises:

the first filtering subunit is used for filtering the metagenome sequencing data to obtain a high-quality sequence;

a first removal subunit, configured to remove a host sequencing sequence from the high-quality sequence, remove a redundant sequence, and obtain a removed sequence;

and the comparison subunit is used for comparing the removed sequence with a reference sequence to obtain target data.

Optionally, the method further comprises:

and the splicing subunit is used for splicing the removed sequence to obtain a spliced sequence if the length of the removed sequence is smaller than a preset length threshold.

Optionally, the screening unit comprises

The identifier unit is used for determining the length of the open reading frame and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

a second filter subunit, configured to filter an initial sequence in which a stop codon exists in the middle of the sequence in the initial sequence and a difference between translated start coordinates of two overlapping initial sequences is not a multiple of three, so as to obtain a filtered sequence;

and the second removal subunit is used for removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

Optionally, the analysis unit comprises:

the acquisition subunit is used for acquiring absolute position information of the code reading of each target sequence;

the sequence splicing subunit is used for splicing the target sequences based on the absolute position information and combining the spliced target sequences into a gene vector matrix;

the generating subunit is used for generating a gene characteristic self-learning solver according to the gene vector matrix and obtaining an optimal solution of the learning rate;

and the predictor unit is used for performing gene prediction according to the learning rate optimal solution to obtain the candidate species category of the sample to be detected.

Compared with the prior art, the invention provides a method and a device for detecting pathogenic microorganisms based on metagenomics, which comprises the following steps: acquiring metagenome sequencing data of a sample to be detected; preprocessing the metagenome sequencing data to obtain target data, wherein the target data is the metagenome sequencing data meeting target quality conditions; screening target data to obtain a target sequence; performing cluster analysis on the target sequence to obtain candidate species categories of the sample to be detected; comparing the target data with the non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected; comparing the target data with information in a pathogenic microorganism detectable database to obtain drug resistance genes and toxic element information of the sample to be detected; and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected. The invention improves the detection applicability range and the detection accuracy of pathogenic microorganisms.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for detecting pathogenic microorganisms based on metagenomics according to an embodiment of the present invention;

FIG. 2 is a flow chart of a pathogenic microorganism self-learning detection system provided by an embodiment of the invention;

fig. 3 is a schematic structural diagram of a pathogenic microorganism detection device based on metagenomics according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

The embodiment of the invention provides a pathogenic microorganism detection method of metagenomics, belongs to the field of pathogenic microorganism screening and detection, and is mainly based on the steps of obtaining original sequencing data, controlling quality, removing a host, annotating genes, annotating a sub-redundant sequence set and outputting a detection result. The method has the advantages of wide application range, comprehensive screened species and accurate detection, and can accurately screen the composition and pathogenic genes of microorganisms in a sample.

For the purpose of facilitating the description of the present invention, the pertinent terms will now be explained.

Raw sequencing data (Raw reads): refers to the data taken directly from the sequencer, i.e., the raw off-line data for high throughput sequencing.

Sequencing sequence (read, also commonly referred to as read or read length): and obtaining a piece of sequence information consisting of bases by a sequencing technology.

Open Reading Frames (ORFs): refers to a string of sequences that, in a given reading frame, does not contain a stop codon, which is part of the genome of an individual organism that is likely to be a protein coding sequence.

Referring to fig. 1, a schematic flow chart of a method for detecting pathogenic microorganisms based on metagenomics according to an embodiment of the present invention may include the following steps:

s101, obtaining metagenome sequencing data of a sample to be detected.

The sample to be detected is a sample of unknown pathogenic microorganism to be detected, and the corresponding metagenome sequencing data is original sequencing data (Raw reads), namely sequencing data which is not subjected to high-quality screening and the like.

S102, preprocessing the metagenome sequencing data to obtain target data.

In order to ensure the accuracy and the processing efficiency of subsequent data processing, in the embodiment of the present application, the original sequencing data is preprocessed to obtain metagenome sequencing data meeting a target quality condition, where the target quality condition is a condition determined based on an actual application scenario, and may include a condition of which sequences and high-quality sequences are filtered out, and the like.

In an implementation manner of the embodiment of the present invention, the preprocessing the metagenome sequencing data to obtain target data includes: filtering the metagenome sequencing data to obtain a high-quality sequence; removing the host sequencing sequence in the high-quality sequence, removing the redundant sequence and obtaining the removed sequence; and comparing the removed sequence with a reference sequence to obtain target data.

Specifically, metagenome sequencing data (i.e., original sequencing data) of a sample to be detected is filtered, the high-quality sequence is screened, and the obtained high-quality sequence is compared with a sequence of a host genome removed from a reference genome. The process of screening the high-quality sequence is to compare the high-quality target sequence with a reference genome, and comprises two parts, namely reference genome knowledge base construction and high-quality reads comparison. Reference genome building libraries: for a redundant reference genome of a pathogenic microorganism, redundant sequences can be removed. High quality reads alignment and analysis: and comparing the processed Clean Reads with the metagenome reference sequence to obtain a compared sequence.

It should be noted that, in the pretreatment process, the method further includes: and if the length of the removed sequence is smaller than a preset length threshold value, splicing the removed sequence to obtain a spliced sequence. That is, for ease of processing, short sequences can be spliced into longer long sequences (i.e., scaffold), thereby enabling processing that is suitable for short reads.

S103, screening the target data to obtain a target sequence.

After the target data is obtained, i.e., clear Reads. In order to facilitate the subsequent acquisition of species fixation and functional fixation information, it is necessary to acquire Open Reading Frames (ORFs) of a plurality of weak learners, and identify the ORFs in the target data set, i.e., acquire a part of the genome of an individual organism, which may be a protein coding sequence. It should be noted that, in the embodiment of the present invention, the extraction length of the open reading frame may be determined according to actual requirements, that is, a sequence with any length meeting the actual requirements may be extracted. After extraction, the corresponding pseudogene needs to be filtered, and the sequence containing the stop codon is removed according to the translated amino acid, so as to finally obtain the target sequence.

And S104, performing cluster analysis on the target sequence to obtain the candidate species category of the sample to be detected.

After the target sequence is obtained, the genes are expanded based on the absolute position information of the target sequence, namely the target sequence meeting the training is spliced, then the spliced sequence is converted into a corresponding gene vector matrix, and self-learning solving is carried out to obtain the predicted new genes, namely the possible gene species types.

S105, comparing the target data with a non-redundant reference gene set, and calculating the abundance of each gene in each sample to obtain the target species classification information of the sample to be detected;

s106, comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected;

s107, determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected.

After determining the candidate species category, a species classification needs to be further determined, that is, species classification information of the sample to be detected can be determined based on abundance calculation, and the species classification obtained by detecting pathogenic microorganisms in the embodiment of the present invention may include: bacteria, viruses, fungi, parasites, bifidobacteria, mycoplasma, chlamydia, rickettsia, archaea and the new coronavirus COVID-19. And comparing the target data with information in a pathogenic microorganism detectable database to obtain the drug resistance gene and toxic element information of the sample to be detected, and outputting the target species classification information, the drug resistance gene and the toxic element information as a final detection result of the sample to be detected, for example, a detection report can be generated according to the information.

It should be noted that, in the embodiment of the present invention, the process of obtaining the target species classification information, the drug resistance gene, and the toxic element information is a self-learning process, and an unsupervised learning mode or other self-learning modes may be mainly adopted, so that the data processing system can learn the species classification information, the drug resistance gene, the resistance gene, and the virulence factor to obtain a corresponding original pathogenic microorganism knowledge base, and compare the obtained sample data with data in the corresponding knowledge base to obtain a final detection result. The specific implementation process will be described in detail in the following examples of the present invention.

The invention provides a pathogenic microorganism detection method based on metagenomics, which comprises the following steps: acquiring metagenome sequencing data of a sample to be detected; preprocessing the metagenome sequencing data to obtain target data, wherein the target data is the metagenome sequencing data meeting target quality conditions; screening target data to obtain a target sequence; performing cluster analysis on the target sequence to obtain candidate species categories of the sample to be detected; comparing the target data with the non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected; comparing the target data with information in a pathogenic microorganism detectable database to obtain drug resistance genes and toxic element information of the sample to be detected; and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected. The invention improves the detection applicability range and the detection accuracy of pathogenic microorganisms.

In an implementation manner of the embodiment of the present invention, the screening the target data to obtain a target sequence includes

Determining the length of an open reading frame, and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

filtering the initial sequence which has a stop codon in the middle of the sequence in the initial sequence and is provided with a difference value of translation initiation coordinates of the two overlapped initial sequences which is not a multiple of three to obtain a filtered sequence;

and removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

The length of the open reading frame is determined according to actual detection requirements and the property of a detection sample. Specifically, Open Reading Frame (ORFs) sets of a plurality of weak learners are obtained, and the ORFs in the target data are identified. Then, when the obtained ORFs have stop codons in the middle of the sequences, the ORFs are not compliant with the true genes, and are directly filtered out. Meanwhile, the difference value of the translation starting coordinates of two overlapped ORFs is required to be a multiple of 3, and the ORFs which do not meet the requirements are judged to be pseudogenes, so that the pseudogenes are filtered out. Based on the translated amino acids, the sequence containing the stop codon is cut off, the stop codon in the real protein reference sequence does not translate the amino acids and is not shown, and the stop codon in the ORFs set is cut off in the later verification.

In another embodiment of the present invention, the performing cluster analysis on the target sequence to obtain candidate species categories of the sample to be tested includes:

acquiring absolute position information of the reading code of each target sequence;

splicing the target sequences based on the absolute position information, and combining the spliced target sequences into a gene vector matrix;

generating a gene characteristic self-learning solver according to the gene vector matrix, and obtaining an optimal solution of the learning rate;

and performing gene prediction according to the optimal learning rate solution to obtain the candidate species category of the sample to be detected.

In this embodiment, it is necessary to unify the output coordinates of the ORFs set to expand the genes. Wherein the output coordinates refer to coordinate parameters determined at the start position and the end position based on the ORF. And comparing the DNA scaffold corresponding to the ORF to find out the absolute position of the reading frame. The position of ORF1 was defined as (x 1, y 1) and the position of ORF2 as (x 2, y 2), treated in three cases: (1) y2< = y1, and (x 2-x 1)% 3= =0, retaining ORF 1; (2) x2< = y1, and y1< = x2, while satisfying (x 2-x 1)% 3= 0, splicing ORF1 with ORF2 to form a new ORF3 (x 1, y 2), (3) y1< = x2, retaining ORF1 and ORF2, where x denotes the start position of each ORF and y denotes the end position of each ORF.

And (3) converting the distribution of the ORFs in the learning method into a support row vector gi of each ORF, and combining all gene row vectors { G1, G2, …, gi } into a gene vector matrix G, wherein i is a natural number between 1 and N. And (3) whether the ORFs are true genes is judged as a clustering label, the clustering label is 1, whether the ORFs are 0 is not, a label vector h is generated, and a gene feature self-learning solver with Gx = h is generated, wherein the limitation condition is that the sum of the learning rate x is 1. From the genetic feature self-learning solver, the optimal solution for learning rate x = max { N/N }, where N represents the number of all correct ORFs and N represents the sum of all genes. And (4) predicting a new gene by taking the optimal learning rate x as the input of a gene prediction model, namely obtaining the candidate species category of the sample to be detected.

Species are measured in the examples of the present invention using the relative abundance of genes. The target data were aligned to a non-redundant set of reference genes and the abundance of each gene in each sample was calculated. The relative abundance of the reference gene is calculated by the following method for calculating the relative abundance of the species in the arbitrary sample S: calculating the copy number of each species by the following method: ci = Si/Li; calculating the relative abundance of the species i by the following method: ai = Ci/(Σcj) = (Si/Li)/[ Σ (Sj/Lj) ]. Wherein: ai represents the relative abundance of species i relative to sample S; li represents the sequence length of species i; si represents the total number of reads that species i can be detected in sample S; total number of copies of species i in Ci sample S; Σ denotes a summation sign.

In the examples of the present invention, drug resistance genes and toxic elements were screened: comparing the database with the detectable pathogenic microorganisms. It should be noted that in the detection process of pathogenic microorganism detection and self-learning in the embodiment of the present invention, the pathogenic source can be detected in a large scale, the knowledge base covers the detection of more than ten large-scale species such as viruses and bacteria which are common to human, and the knowledge base further includes the nucleic acid data of the new coronavirus COVID-19. The kit can accurately detect pathogenic microorganisms infected by a patient, help a clinician to quickly identify the pathogenic microorganisms, and promote accurate detection of the mNGS pathogenic microorganisms.

The following describes embodiments of the present invention in a specific application scenario.

75bp paired-end reads data obtained by mNGS sequencing, a standard sequencing sample of mNGS (sample number: S1), a sampling type: swab, S1 is next used for testing of the pathogenic microorganism detection system of the present invention. The flow chart of the pathogenic microorganism self-learning detection system is shown in figure 2.

The invention adopts the methods of unsupervised learning GeneMarkS-2, hidden Markov learning FragGeneScan, scoring strategy MetaGeneAnnotator, dynamic programming Prodigal, neural network learning Orphelia and interpolation Markov Glimer 3. Among the functions that can be implemented by the detection system are, but not limited to: data quality control and statistics, host removal and statistics, knowledge base comparison, species classification, statistics of comparison results, species and gene abundance statistics, database annotation and the like.

In the embodiment of the invention, nucleic acid sequence databases, drug resistance genes, virulence factors and the like from NCBI, GISAID and the like can be downloaded, an original pathogenic microorganism knowledge base is established, and a non-redundant sequence set is established through a Gcluster algorithm. When assembly is required, assembly and genome prediction: the data filtering is performed in a two-terminal model of trimmatic (Version 0.36, parameter settings: SLIDINGWINDOW 4:15 LEADING 3 TRAILING 3 MINLEN 90 MAXINFO 80: 0.5). Parameter setting of metaSPIDs software: -meta-only-assembler, with default values for the remaining parameters. The experimental output results K-mer are K21, K33 and K55, the standard output result is K55, and the scaffolds assembly result is used as the input of the prediction software.

As shown in Table 1, the statistics before and after the Reads filtration show that the total Reads number before the filtration is 22,665,207, the clear Reads number after the filtration is 22,609,981, and the ratio of Q30 is 96.015 according to the base quality results of the original Reads and the filtered Reads at each position. The qualitative results of the species-level data distribution of pathogenic microorganisms are shown in table 2, and the results show that staphylococcus epidermidis is detected in the highest abundance, and the detection abundances of other species are shown in the table. Table 3 shows a list of the abundance of the detected specific species of pathogenic microorganisms detected from the S1 sample. The gene and pathway analysis is shown in tables 4-6. The results of the drug resistance genes and the resistance genes are shown in tables 7 to 8. Through the annotation of the CARD database, information such as antibiotic resistance genes and action mechanisms can be found, the total number of reads of the resistance genes in comparison is 262, and the table 8 shows. Virulence element screening results are shown in table 9.

TABLE 1 statistics before and after Reads filtration of sample S1

TABLE 2 qualitative results of detection of pathogenic microorganism of sample S1

TABLE 3 List of specific species detection abundances of detected pathogenic microorganisms of sample S1

TABLE 4 abundance List of Gene families for sample S1

A gene family is a group of evolutionarily related protein coding sequences, usually with similar functions. Gene family abundances were stratified at the population level to show the degree of contribution of known and unknown species. Gene family abundance is reported in RPK (reads per kilobase) units to normalize gene length; RPK units represent the number of copies of a gene or transcript in a population. RPK values can be further and normalized to adjust for differences in sequencing depth for different samples. Nmapde indicates the number of reads that could not be aligned after nucleic acid and protein searches. UniRef90_ unknown represents an alignment to the chocophalan database, but no annotation. Note: only the first 5 gene families are listed in this table.

TABLE 5 passage abundance results for sample S1

The abundance of a pathway represents the abundance of the pathway in the population, both at the population level and at the species level. The channels are sorted according to the abundance, the species components are also sorted according to the abundance, and the channels with all 0 are not output. Note: this table lists only the first 5 vias.

TABLE 6 Path coverage results for sample S1

The pathway coverage provides a population pathway calculation with (1) and without (0), rather than relative abundance. Only the non-zero abundance channel is output, the population level is more credible than the species level, and the channel coverage is in the same order as the channel abundance.

TABLE 7 results of drug resistance genes for sample S1

TABLE 8 resistance Gene results for sample S1

Note: only the first 5 resistance genes are listed in this table.

TABLE 9 sample S1 virulence element screening results

Note: the table lists only the annotation results for the first 5 virulence genes.

The pathogenic microorganism self-learning detection system provided by the invention provides a method for rapidly detecting pathogenic microorganisms based on mNGS data, and can realize genome assembly, resistance gene annotation and the like of unknown microorganisms. In the aspect of detection range, the kit can accurately and rapidly detect various pathogenic microorganisms including bacteria, viruses, fungi, parasites, mycobacterium, mycoplasma, chlamydia, rickettsia, archaea, protozoa and COVID-19, and greatly improves the efficiency of clinical diagnosis. In the aspect of accuracy, the self-learning analysis provided by the invention can screen out the optimal solution of the learning rate by generating the gene characteristic self-learning solver and takes the optimal solution as the input of the gene prediction model, thereby effectively improving the accuracy of gene prediction. In the aspect of unknown microorganisms, the invention can realize the assembly of unknown microorganism genomes and the annotation of drug resistance genes, virulence factors and the like, and provides a reliable basis for exploring the relevant pathogenicity of the microorganisms.

The embodiment of the present invention further provides a pathogenic microorganism detection apparatus based on metagenomics, referring to fig. 3, including:

the acquisition unit 10 is used for acquiring metagenome sequencing data of a sample to be detected;

a preprocessing unit 20, configured to preprocess the metagenomic sequencing data to obtain target data, where the target data is metagenomic sequencing data meeting a target quality condition;

a screening unit 30, configured to screen the target data to obtain a target sequence;

the analysis unit 40 is configured to perform cluster analysis on the target sequence to obtain a candidate species category of the sample to be detected;

the calculating unit 50 is used for comparing the target data with a non-redundant reference gene set and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected;

a comparison unit 60, configured to compare the target data with information in a detectable pathogenic microorganism database, so as to obtain information of a drug resistance gene and a toxic element of the sample to be tested;

a determining unit 70, configured to determine the target species classification information, the drug resistance gene, and the toxic element information as a detection result of the sample to be detected.

Further, the preprocessing unit includes:

the first filtering subunit is used for filtering the metagenome sequencing data to obtain a high-quality sequence;

a first removal subunit, configured to remove a host sequencing sequence from the high-quality sequence, remove a redundant sequence, and obtain a removed sequence;

and the comparison subunit is used for comparing the removed sequence with a reference sequence to obtain target data.

Optionally, the method further comprises:

and the splicing subunit is used for splicing the removed sequence to obtain a spliced sequence if the length of the removed sequence is smaller than a preset length threshold.

Optionally, the screening unit comprises

The identifier unit is used for determining the length of the open reading frame and identifying the target data by using the open reading frame with the length to obtain an initial sequence;

a second filter subunit, configured to filter an initial sequence in which a stop codon exists in the middle of the sequence in the initial sequence and a difference between translated start coordinates of two overlapping initial sequences is not a multiple of three, so as to obtain a filtered sequence;

and the second removal subunit is used for removing the sequence containing the stop codon in the filtered sequence according to the translated amino acid to obtain the target sequence.

Further, the analysis unit includes:

the acquisition subunit is used for acquiring absolute position information of the code reading of each target sequence;

the sequence splicing subunit is used for splicing the target sequences based on the absolute position information and combining the spliced target sequences into a gene vector matrix;

the generating subunit is used for generating a gene characteristic self-learning solver according to the gene vector matrix and obtaining an optimal solution of the learning rate;

and the predictor unit is used for performing gene prediction according to the learning rate optimal solution to obtain the candidate species category of the sample to be detected.

The embodiment of the invention provides a pathogenic microorganism detection device based on metagenomics, which comprises: acquiring metagenome sequencing data of a sample to be detected; preprocessing the metagenome sequencing data to obtain target data, wherein the target data is the metagenome sequencing data meeting target quality conditions; screening target data to obtain a target sequence; performing cluster analysis on the target sequence to obtain candidate species categories of the sample to be detected; comparing the target data with the non-redundant reference gene set, and calculating the abundance of each gene in a single sample to obtain the target species classification information of the sample to be detected; comparing the target data with information in a pathogenic microorganism detectable database to obtain drug resistance genes and toxic element information of the sample to be detected; and determining the classification information of the target species, the drug resistance gene and the toxic element information as the detection result of the sample to be detected. The invention improves the detection applicability range and the detection accuracy of pathogenic microorganisms.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于拟靶向代谢组学深度指纹实现细菌分类与鉴定的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!