Sample bacterial species detection methods and systems

文档序号：1818158 发布日期：2021-11-09 浏览：4次中文

阅读说明：本技术 样品细菌物种检测方法和系统 (Sample bacterial species detection methods and systems ) 是由周哲敏董少华于 2021-08-06 设计创作，主要内容包括：本申请公开了一种样品细菌物种检测方法,该方法包括对所述样品细菌进行测序,并将测序得到的核酸序列作为输入序列进行组装形成拼接结果；提取所述拼接结果中的细菌广泛保守基因；将所述细菌广泛保守基因与细菌保守基因序列数据集进行比较,获取每个保守基因的最近似物种列表；其中,所述细菌保守基因序列数据集是预先应用保守基因鉴定方法在大量细菌基因组序列中提取得到的；以及将每个细菌广泛保守基因的所述最近似物种列表整合。本申请中的方法可以有效对样品中的多菌混合进行鉴定,并且由于使用了稳定的保守基因,确保结果的特异性。本申请还公开了对应的系统。(The application discloses a sample bacterium species detection method, which comprises the steps of sequencing sample bacteria, and assembling a nucleic acid sequence obtained by sequencing as an input sequence to form a splicing result; extracting a bacterial extensive conserved gene in the splicing result; comparing the bacterial widely conserved genes with a bacterial conserved gene sequence data set to obtain a list of the nearest similar species of each conserved gene; wherein the bacterial conserved gene sequence data set is extracted from a large number of bacterial genome sequences in advance by applying a conserved gene identification method; and integrating the most closely related species list for each of the genes that are widely conserved among bacteria. The method can effectively identify the mixture of multiple bacteria in the sample, and ensures the specificity of the result due to the use of stable conservative genes. The application also discloses a corresponding system.)

1. A method for detecting bacterial species in a sample, comprising: the method comprises

Sequencing the sample bacteria, and assembling a nucleic acid sequence obtained by sequencing as an input sequence to form a splicing result;

extracting a bacterial extensive conserved gene in the splicing result;

comparing the bacterial widely conserved genes with a bacterial conserved gene sequence data set to obtain a list of nearest similar species of each widely conserved gene; wherein the bacterial conserved gene sequence data set is extracted from a large number of bacterial genome sequences in advance by applying a conserved gene identification method; and

integrating the most approximate species list for each bacterial broadly conserved gene to integrate alignments derived from the same species in different broadly conserved gene alignments to calculate genetic similarity of the sample relative to all bacterial species included in the bacterial conserved gene sequence data set, wherein bacterial species with higher similarity are more likely to be present in the sample.

2. The method for detecting bacterial species according to claim 1, characterized in that: and performing processing operations such as removing joints, removing low-quality areas, connecting coverage areas of double-end sequencing results and the like on the input nucleic acid sequence so as to improve the overall reliability of the sequencing results.

3. The method for detecting bacterial species according to claim 1, characterized in that: calculating the widely conserved genes based on a hidden Markov model.

4. The method for detecting bacterial species according to claim 1, characterized in that: further screening of the widely conserved genes is based on alignment score, similarity or alignment region length.

5. The method for detecting bacterial species according to claim 1, characterized in that: the bacterial conserved gene sequence data set records species information of the source bacteria and their genetic similarity to the most closely related model strains.

6. The method for detecting bacterial species according to claim 1, characterized in that: the alignment search is performed via traditional local sequence alignment methods, or via short sequence-based efficient alignment methods.

7. The method for detecting bacterial species according to claim 1, characterized in that: further comprising extracting a plurality of mixed bacterial species most likely to be present in the sample bacteria using one or more of the methods comprising greedy algorithm, maximum likelihood method, gradient descent method, bayesian analysis, and the like.

8. The method for detecting bacterial species according to claim 7, characterized in that: in the greedy algorithm, the results are screened for multiple times, after the best comparison result is obtained by screening each time, all widely conserved genes containing the comparison results are discarded, and the suboptimal comparison result is obtained by screening again; the screening step is repeatedly executed until all the predicted widely conserved genes find potential bacterial species sources; or in the maximum likelihood method, a plurality of hypotheses are set, one, two or more bacterial species are respectively assumed to exist in the sample, then various possible combinations are arranged, the widely conserved gene comparison result which is supposed to appear under the condition of different bacterial species combinations is predicted, the likelihood of various combinations is calculated after the predicted result is compared with the actual comparison result, and then the most possible bacterial species composition is selected.

9. A sample bacterial species detection system, characterized by: the system includes one or more processors; a memory for storing one or more programs such that the one or more processors wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform the sample bacterial species detection method of any one of the above.

10. A sample bacterial species detection system, characterized by: comprises that

The nucleic acid sequence assembling module is used for sequencing the sample bacteria and assembling the nucleic acid sequence obtained by sequencing as an input sequence to form a splicing result;

the conservative gene identification module is used for extracting a bacterial wide conservative gene in the splicing result;

a conserved gene kindred sequence retrieval module for comparing the bacterial widely conserved genes with a bacterial conserved gene sequence data set to obtain a list of the nearest similar species of each widely conserved gene; wherein the bacterial conserved gene sequence data set is extracted from a large number of bacterial genome sequences in advance by applying a conserved gene identification method; and

a kindred sequence integration analysis module for integrating the most approximate species list for each bacterial broadly conserved gene to integrate alignments derived from the same species in different broadly conserved gene alignments to calculate genetic similarity of the sample with respect to all bacterial species included in the bacterial conserved gene sequence data set, wherein bacterial species with higher similarity are more likely to be present in the sample.

Technical Field

The present invention relates to: the invention relates to the field of pathogen detection, in particular to a pathogenic bacterium detection method and system based on genome sequencing.

Background

Common multiple bacteria mixtures in environmental or clinical samples, where the pathogen and the approximate non-pathogen are genetically close and difficult to distinguish. The existing pathogen detection system has a larger perfect space in accuracy and feasibility, and the specific defects comprise:

(1) the 16S ribosomal RNA sequence diversity-based detection method has the advantages that due to the fact that the conservation degree of the 16S ribosomal RNA sequence is high, related detection methods can only identify microorganisms to the genus or species level, and cannot further accurately distinguish the microorganisms.

(2) In the detection method based on the non-conservative genes in the genome, because the non-conservative gene region is easy to generate transverse genetic transfer or lose random fragments, false positive and false negative results can be generated in the detection.

(3) In the detection method based on the conserved gene, because the same gene in a multi-bacterium mixed sample has diversified sequences, the result can not be obtained or only the analysis result of a single species can be obtained by directly applying sequence comparison.

(4) Some methods assume that only a single species is present in the sample, thus leading to the omission of other species and false negative results.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a system and a method for accurately distinguishing mixed pathogenic microorganisms and non-pathogenic microorganisms in clinical and environmental samples based on a metagenome sequence, and solves the problems of low resolution and accuracy of the existing detection method.

In order to achieve the above object, some embodiments of the present application provide a method for detecting a sample bacterial species, the method including sequencing the sample bacteria, and assembling a nucleic acid sequence obtained by sequencing as an input sequence to form a spliced result; extracting a bacterial extensive conserved gene in the splicing result; comparing the bacterial widely conserved genes with a bacterial conserved gene sequence data set to obtain a list of nearest similar species of each widely conserved gene; wherein the bacterial conserved gene sequence data set is extracted from a large number of bacterial genome sequences in advance by applying a conserved gene identification method; and a step of integrating the most approximate species list of each bacterial widely conserved gene to integrate alignments derived from the same species in different widely conserved gene alignments to calculate genetic similarity of the sample with respect to all bacterial species included in the bacterial conserved gene sequence data set, wherein bacterial species with higher similarity are more likely to be present in the sample.

Some embodiments of the present application provide a sample bacterial species detection system, the system comprising: the nucleic acid sequence assembling module is used for assembling the nucleic acid sequence obtained by sequencing as an input sequence to form a splicing result; the conservative gene identification module is used for extracting a bacterial wide conservative gene in the splicing result; the conserved gene kindred sequence retrieval module is used for comparing the bacterial widely conserved genes with a bacterial conserved gene sequence data set to obtain a recent similar species list of each widely conserved gene; and a kindred sequence integration analysis module for integrating the most approximate species list for each bacterial broadly conserved gene to integrate alignments derived from the same species in different broadly conserved gene alignments to calculate genetic similarity of the sample with respect to all bacterial species included in the bacterial conserved gene sequence data set, wherein bacterial species with higher similarity are more likely to be present in the sample.

Some embodiments of the present application provide a sample bacterial species detection system, the system comprising: one or more processors; a memory for storing one or more programs such that the one or more processors wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform the sample bacterial species detection method of any one of the above.

The bacterial species detection system provided by the invention can effectively identify the mixture of multiple bacteria in a sample, and ensures the specificity of the result due to the use of stable conservative genes, thereby simultaneously overcoming excessive false negative and false positive results in bacterial detection and improving the accuracy of pathogenic bacteria detection.

Drawings

FIG. 1 is a flow chart of a sample bacterial species detection method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a hardware environment of a sample bacterial species detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a sample bacterial species detection system according to an embodiment of the present application.

The specific implementation mode is as follows:

the following detailed description of embodiments of the present application refers to the accompanying drawings.

It will be readily understood that the components of certain exemplary embodiments, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of some example embodiments of systems, methods, apparatuses, and computer program products related to an interactive multimedia architecture is not intended to limit the scope of some embodiments, but is representative of selected example embodiments.

The features, structures, or characteristics of the example embodiments described throughout the specification may be combined in any suitable manner in one or more example embodiments. For example, throughout the specification, use of the phrases "certain embodiments," "some embodiments," or other similar language refers to the fact that: a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment. Thus, appearances of the phrases "in certain embodiments," "in some embodiments," "in other embodiments," or other similar language throughout this specification are not necessarily all referring to the same group of embodiments, and the described features, structures, or characteristics may be combined in any suitable manner in one or more example embodiments. In addition, the phrase "a group" refers to a group that includes one or more of the referenced group members. Thus, the phrases "a group," "one or more," and "at least one," or equivalent terms, may be used interchangeably. In addition, "or" is intended to mean "and/or" unless explicitly stated otherwise.

In addition, the different functions or operations discussed below may be performed in a different order and/or concurrently with each other, if desired. Furthermore, if desired, one or more of the described functions or operations may be optional or may be combined. As such, the following description should be considered as merely illustrative of the principles and teachings of certain exemplary embodiments, and not in limitation thereof.

FIG. 1 is a flow chart of a sample bacterial species detection method provided by an embodiment of the present invention, which may be executed by a processor in the form of a computer program. The method specifically comprises the following operation steps:

nucleic acid sequence assembling step S1. In this step, input nucleic acid sequences, which refer to sequencing-result nucleic acid sequences generated, for example, by sequencing a sample bacterial genome via a second generation sequencing-by-synthesis platform or a third generation single molecule sequencing platform, are spliced using methods such as the de bruuene diagram model-based (including the colored de bruuene diagram model). Optionally, the input nucleic acid sequence may be subjected to processing operations such as removing a linker, removing a low-quality region, connecting a coverage region of a double-end sequencing result, and the like, so that the overall reliability of the sequencing result is improved, and the assembly accuracy is improved.

Conserved gene identification step S2. Most genes in the bacterial genome can be deleted under different conditions, and a small part of the genes exist in almost all bacterial species and are decisive for bacterial survival, so that the genes are called bacterial widely conserved genes or conserved genes. Widely conserved genes obtained in advance by calculation based on a large number of bacterial genes can be saved, and a conserved nucleic acid sequence in each widely conserved gene is calculated based on a hidden markov model. And (4) comparing and predicting the sample splicing result obtained in the step (S1) with a bacterium conserved gene hidden Markov model to obtain the nucleic acid sequences of all potential conserved genes in the sample bacteria. Optionally, step S2 may further include screening the conserved genes based on the alignment score, similarity or alignment region length according to additional screening mechanisms, thereby reducing errors in the prediction of conserved genes. These screening conditions were previously obtained based on a large number of example tests in the process of constructing the conserved gene data set B5.

Conserved gene kindred sequence search step S3. As described above, since a widely conserved gene of bacteria exists in almost all bacterial species, all potential bacterial species components in a sample can be found by comparing each of the identified conserved genes in the sample bacteria and the bacterial species conserved gene sequence data set B5 previously stored in the bacterial species detection system a with each other. Wherein the bacterial conserved gene sequence data set B5 is extracted from a large number of bacterial genome sequences by using a conserved gene identification module S2 in advance. Species information of the source bacteria and their genetic similarity to the most recent pattern strain (type strain) are recorded in the alternative bacterial conserved gene sequence dataset B5. Alternatively, alignment finding can be done via traditional local sequence alignment methods, or via short sequence (kmer) based efficient alignment methods.

And a closely related sequence integration analysis step S4. Each of the conservative gene alignments obtained contains a large number of highly similar bacterial reference sequences, which may be derived from a different but closely related number of bacterial species, including a large number of false negative and false positive results, that require further screening. To this end, the genetic similarity of the sample to all bacterial species included in bacterial conserved gene sequence data set B5 was calculated by integrating the alignments from the same species in different conserved gene alignments in this near sequence integration analysis step. Wherein the higher the similarity the more likely the bacterial species is to be present in the sample. Alternatively, due to the imperfection of the sample and the reference database, a part of the conserved genes may be deleted in the alignment result, so that in different embodiments, the screening index may be adjusted to the total alignment score, the Average amino acid similarity (Average amino acid identity) or the total length of the alignment region in the step of integrating and analyzing the near sequence S4.

In order to obtain all the bacterial species present from the alignment, the step of integrating and analyzing the proximal sequences S4 can further screen the integrated alignment, step S5. In various examples, the selectable methods include one or more of greedy algorithms, maximum likelihood methods, gradient descent methods, bayesian analysis, and the like. For example, in the alternative greedy algorithm, the closely related sequence integration analysis step S4 performs multiple rounds of screening on the results, and after each screening obtains the best alignment result, all conserved genes containing the alignment results are discarded, and then re-screening is performed to obtain the next best alignment result. This screening step is repeated until all of the conserved genes predicted in step S2 find a potential source of bacterial species. For another example, in the maximum likelihood method, the near sequence integration analysis module S4 sets multiple hypotheses, respectively assumes that one, two, or more bacterial species exist in the sample, then ranks the various possible combinations, predicts the conservative gene comparison result that should appear under different bacterial species combinations, compares the prediction result with the actual comparison result, calculates the likelihood of each combination, and further selects the most likely bacterial species composition. Further, the bacterial species combinations obtained based on the assumptions of the numbers of different bacterial species are compared with each other, and the most likely number of bacterial species and corresponding species combinations to be present in the sample are selected by applying an optional Akaike information criterion or Bayesian information criterion.

Embodiments of the present application provide sample bacterial species detection methods that may be based on a sample bacterial species detection system that may include one or more hardware platforms that include a display module. In some embodiments the sample bacterial species detection system may be a general purpose computer, or a sequencing device with computational processing capabilities. The sequencing device may be a sequencing-by-synthesis based second generation sequencing device, or a single molecule sequencing based third generation sequencing device. As shown in fig. 2, the bacterial species detection system a includes an internal communication bus a1, a hard disk a2, a processor A3, a random access memory a4, an input/output component a5, a communication port a6, and a user interface a 7. The internal communication bus A1 allows data to be communicated between the various components, and the hard disk A2 contains one or more program modules for bacterial species detection. The program in a2 is executed in processor A3, and holds intermediate calculation results using random access memory a4, and the final result is stored in a 2. In some examples, bacterial species detection system a may receive and transmit information and data from a network through communication port a 6. Interaction of bacterial species detection system a and the user may be via user interface a7 or communication port a 6. In some examples, the various components of bacterial species detection system a may be in different hardware devices or geographic locations and interconnected via the internet, a corporate intranet, or a combination thereof.

In some embodiments, the nucleic acid sequences to be assembled may be introduced into bacterial species detection system a directly through input/output component a6 after being generated from a sequencing platform, or into species detection system a from a network via communication port a 7. A conservative gene database may be maintained in bacterial species detection system A.

In some example embodiments, the functions of any of the methods, processes, signaling diagrams, algorithms, or flow diagrams described herein may be implemented by software and/or computer program code or portions of code stored in memory or other computer-readable or tangible media, and executed by a processor.

In some example embodiments, an apparatus may be included or associated with at least one software application, module, unit or entity configured as arithmetic operations, or as programs or portions thereof (including added or updated software routines), executed by at least one operating processor. Programs, also referred to as program products or computer programs, including software routines, applets and macros, may be stored in any device-readable data storage medium and may include program instructions for performing particular tasks.

A sequence is a unit of a data structure that may include strings, lists, tuples, and the like.

For example, the nucleic acid sequence assembly step can be embodied as the nucleic acid sequence assembly module B1, the conserved gene identification step can be embodied as the conserved gene identification module B2, the conserved gene kindred sequence search step can be embodied as the conserved gene kindred sequence search module B3, and the kindred sequence integration analysis step can be embodied as the kindred sequence integration analysis module B4. As shown in fig. 3.

As an experimental example of the present application, a working table X and a pool Y in a clean area of a pharmaceutical factory were wiped with cotton swabs, and after obtaining samples, total microorganisms in the samples were sequenced, and sequencing results of 351MB and 326MB were obtained.

The sequencing results of both samples were input to bacterial species identification system a via communication port a6, whose processor A3 run the sample bacterial species detection method of the present application. After the nucleic acid sequence assembling module B1 assembles the sequencing result, the X sample obtains a splicing result of 29095846 bases, and the Y sample obtains a splicing result of 12862918 bases.

The conserved gene identification module B2 identifies 217 bacteria widely conserved genes in the X sample and 122 bacteria widely conserved genes in the Y sample.

Conserved gene kindred sequence search module B3 reported up to 1000 most similar microbial species for each conserved gene.

The genetic sequence integration analysis module B4 predicts the bacterial composition in sample X using a greedy algorithm to predict the microbial species present in the five samples, namely "Stenotrophomonas maltophilia", "Delftia acidovorans", "Brevundimonas diminuta", "Comamonas testosteroni", "Brucella antrophila" and "Achromobacter pulmonis", respectively.

The genetic sequence integration analysis module B4 predicts the bacterial composition in sample Y using the maximum likelihood algorithm, and predicts the microbial species present in the three samples, respectively "Bacillus cereus", "Pseudomonas stutzeri" and "Kocuria palustris".

A computer program product may comprise one or more computer-executable components configured to perform some example embodiments when the program is run. The one or more computer-executable components may be at least one software code or code portion. Changes and configurations to implement the functions of the example embodiments may be performed as routines, which may be implemented as added or updated software routines. In an example, a software routine may be downloaded into the device.

By way of example, the software or computer program code or portions of code may be in source code form, object code form, or in some intermediate form, and may be stored on some type of carrier, distribution medium, or computer-readable medium, which may be any entity or device capable of carrying the program. Such a carrier may comprise, for example, a record medium, computer memory, read-only memory, an optical and/or electrical carrier signal, a telecommunication signal and/or a software distribution package. Depending on the required processing power, the computer program may be executed in a single electronic digital computer or may be distributed over a plurality of computers. The computer-readable medium or computer-readable storage medium may be a non-transitory medium.

In other example embodiments, the functions may be performed by a router, for example, using an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or any other hardware and software combination. In yet another example embodiment, the functionality may be implemented as a signal, such as a non-tangible means that may be carried by electromagnetic signals downloaded from the Internet or other networks.

According to example embodiments, an apparatus such as a node, device or response means may be configured as a circuit, a computer or a microprocessor (such as a single chip computer element) or a chipset, which may comprise at least a memory for providing storage capacity for arithmetic operations and/or an operation processor for performing arithmetic operations.

The example embodiments described herein are equally applicable to both singular and plural implementations, regardless of whether the language used to describe certain embodiments is in the singular or plural. For example, embodiments describing the operation of a single computing device are equally applicable to embodiments that include multiple instances of the computing device, and vice versa.

One of ordinary skill in the art will readily appreciate that the example embodiments as described above may be implemented with operations in a different order and/or with hardware elements in configurations different from those disclosed. Thus, while some embodiments have been described based upon these example embodiments, it would be apparent to those of ordinary skill in the art that certain modifications, variations, and alternative constructions would be apparent, while remaining within the spirit and scope of the example embodiments.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于多个基因组比较和二代测序数据的全基因组关联分析方法

Sample bacterial species detection methods and systems

相关技术

网友询问留言