Calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data

文档序号:1639703 发布日期:2019-12-20 浏览:29次 中文

阅读说明:本技术 基于无拼接组装wgs数据的醋酸钙—鲍曼不动杆菌复合群鉴定方法 (Calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data ) 是由 靳远 岳俊杰 周江林 任洪广 梁龙 黄志松 周静 胡明达 彭小川 王玉洁 张琪 于 2019-09-20 设计创作,主要内容包括:本发明公开了基于无拼接组装WGS数据的醋酸钙—鲍曼不动杆菌复合群鉴定方法。本发明提供了一种基于无拼接组装WGS数据对待鉴定菌进行种属鉴定的方法。本发明的基本原理是通过建立一个完整的菌种基因组指纹特征数据库,然后将待鉴定菌株的WGS测序reads直接打散为片段序列,通过和特征指纹数据库的各菌种进行比较打分,从而实现复合群菌种的鉴定。本发明方法不需要对测序reads进行组装,因而非常简单快捷,利用了全基因组的信息,另一方面由于本发明构建了包含2279种的细菌指纹特征数据库,不仅可以用于鉴定醋酸钙—鲍曼不动杆菌复合群中菌种,同样可适用于其它复合群菌种或其它菌种的鉴定。(The invention discloses a calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data. The invention provides a method for identifying species of bacteria to be identified based on splicing-free assembly WGS data. The basic principle of the invention is to establish a complete strain genome fingerprint characteristic database, then directly break WGS sequencing reads of strains to be identified into fragment sequences, and compare and score the fragment sequences with various strains in the characteristic fingerprint database, thereby realizing the identification of the strains of the compound group. The method of the invention does not need to assemble sequencing reads, thereby being very simple and fast, and utilizing the information of the whole genome, and on the other hand, the invention constructs a bacterial fingerprint characteristic database containing 2279 species, thereby not only being used for identifying the strains in the calcium acetate-acinetobacter baumannii complex group, but also being applicable to the identification of other complex group strains or other strains.)

1. A method for performing species identification on bacteria to be identified based on splicing-free assembly WGS data comprises the following steps:

(A) acquiring all sequenced bacteria genome data from a bacteria genome database, and establishing a relation between strain genome information and strain taxonomy information;

(B) the strain fingerprint feature database is constructed according to the following steps:

(b1) aiming at the whole genome nucleic acid sequence of each strain obtained in the step (A), segmenting the whole genome nucleic acid sequence by taking the length of the nucleic acid sequence as k and the step length as 1, only one segment with repeated sequence is reserved after segmentation, and the obtained segment set is called as a set A; obtaining one said set a for each strain; the fragments that make up the set A are referred to as fragment A;

(b2) four kinds ofThe alphabetical representation of the bases is converted into a numerical storage of 00, 01, 10, 11, respectively, so that each of said segments A can be converted into a 2 k-digit number, i.e. 0 to 22k-a number between 1;

(b3) traverse all 0 to 22kAnd (4) uniformly recording all the segments A in the set A of all the strains according to the number between-1, recording 2k digits corresponding to each segment A and corresponding species information, screening out 2k digits corresponding to the segments A only recording 1 strain, and storing according to strain classification to obtain the strain fingerprint feature database.

(C) Cutting each sequencing read obtained by whole genome sequencing of the strains to be identified according to the method in the step (b1) to obtain the set A of the strains to be identified; converting all of said fragments a in said pool a of said test strains to be identified into 2k digits in the manner of step (b 2); then comparing the fingerprint characteristic data with the strain fingerprint characteristic database obtained in the step (b3), and calculating S according to the comparison resultcoreThe score is calculated, and the strain with the highest score is regarded as the strain to which the strain to be identified belongs;

the strain to be identified is used for representing the 2k digits of the fragment A and a certain strain A in the strain fingerprint feature database is used for representing the 2k digits of the fragment A, and the more the absolute number of the intersection of the two is, the higher the score of the strain A is for the strain to be identified;

the higher the number ratio of the 2k digits of the strain A to be identified to the 2k digits of a certain strain A in the strain fingerprint characteristic database, the higher the number ratio of the intersection of the two numbers to the 2k digits of the strain A in the strain fingerprint characteristic database, the higher the score of the strain A for the strain to be identified.

2. The method of claim 1, wherein: in step (C), the ScoreThe score is calculated according to the following formula:

Score=α*normalization(N)+(1-α)*normalization(P)

N=card(Sx∩Si)

M=card(Si)

wherein S isxA set of 2k digits representing the strain to be identified for representing the fragment A; siA set of 2k digits representing a species in said species fingerprint feature database for representing said fragment a; α is a weighting coefficient; card represents the number of elements in the solution set; n represents the number of elements in the intersection of the 2k digit collection of the strain to be identified and used for representing the fragment A and the 2k digit collection of a certain strain in the strain fingerprint characteristic database; m represents the number of elements in a set of 2k digits of a certain strain in the strain fingerprint feature database for representing the fragment A;

n, P need normalization processing as follows:

Nmin,Nmax,Pmin,Pmaxthe minimum value and the maximum value of N and P are obtained after calculation with all strains in the strain fingerprint characteristic database;

further, α is 0.48.

3. The method according to claim 1 or 2, characterized in that: in the method, step (a) is performed as follows: acquiring the whole Genome sequence data of the bacteria with Complete Genome state Status as Complete Genome from a Genome database of NCBI; and acquiring biological classification metadata information from a Taxonomy database of NCBI, and establishing a relation between strain genome information and strain Taxonomy information according to the TaxID.

4. A method according to any one of claims 1-3, characterized in that: in the step (A), the bacteria are bacteria.

5. A method according to any one of claims 1-3, characterized in that: the bacteria to be identified are bacteria.

6. The method of claim 5, wherein: the bacterium is acinetobacter baumannii.

7. The method of claim 6, wherein: the acinetobacter baumannii is acinetobacter baumannii belonging to a calcium acetate-acinetobacter baumannii complex group.

8. A method of constructing a database of bacterial species fingerprint characteristics comprising steps (a) and (B) of the method of claims 1-7.

9. A bacterial species fingerprint feature database constructed by the method of claim 8.

10. Use of the bacterial species fingerprint database of claim 9 for species identification of bacteria to be identified based on splice-free assembly WGS data.

Technical Field

The invention relates to the technical field of biology, in particular to a calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data.

Background

Acinetobacter (Acinetobacter genus) is a gram-negative bacterium, and currently, the Acinetobacter genus contains 55 species. The pathogenic bacteria of the genus have wide distribution in hospital environment, can survive for a long time, are particularly popular in ICU wards, are very easy to cause infection of critical patients, and often cause bacteremia, pneumonia, meningitis, urinary tract infection, operation site infection and the like. Among them, Acinetobacter baumannii (Acinetobacter baumannii) is the most important and popular pathogenic bacterium for nosocomial infection. The species of Acinetobacter are highly similar genetically and difficult to identify, and the level of identifying the species is still difficult to grasp, wherein the most difficult species to distinguish are Acinetobacter calcoaceticus-Acinetobacter baumannii complex (ACB) complex, which mainly comprises 4 species of Acinetobacter baumannii (Acinetobacter calcoaceticus), Acinetobacter calcoaceticus (Acinetobacter calcoaceticus), Acinetobacter cutetii (Acinetobacter pittiii), Acinetobacter hospital (Acinetobacter nosocomialis), which are very similar in phenotype and genetics and difficult to identify clinically, and the result is often reported as the Acinetobacter calcoaceticus-Acinetobacter baumannii complex. Several species, among them, besides acinetobacter calcoaceticus, are distributed in the environment, causing human infections, although acinetobacter baumannii is considered to be the most prevalent and deadly member of the acinetobacter genus, acinetobacter cutaneus and acinetobacter hospital cause serious invasive diseases.

Acinetobacter baumannii, acinetobacter calcoaceticus, acinetobacter pittanicus and acinetobacter hospital are genetically similar to phenotypes, but have largely different epidemiological characteristics, acinetobacter calcoaceticus exists mainly in environmental specimens, acinetobacter cutaneus pittaniensis exists mainly in skin surfaces and environmental specimens, acinetobacter hospital exists mainly in clinical specimens, and acinetobacter baumannii is one of the main pathogens of nosocomial infection. With the increasing trend of the drug resistance rate of acinetobacter to common antibiotics in recent years, even multiple drug-resistant and pan-drug-resistant strains appear, and the strains attract more attention of clinicians and microbial researchers. Many studies have found that the resistance characteristics of the calcium acetate-acinetobacter baumannii complex are greatly different, and the clinical manifestations and the treatment of the complex are different. The defect of inaccurate strain identification leads to the existence of one-sidedness of a lot of drug resistance and epidemiological data related to the acinetobacter baumannii at present, and seriously influences the understanding of the current drug resistance situation and clinical distribution situation of the acinetobacter baumannii.

The strain identification mainly depends on phenotype difference in clinic, and because the strain similarity in acinetobacter is extremely high, the traditional biochemical identification method has limitation in the identification of acinetobacter species, and the calcium acetate-acinetobacter baumannii complex group can not be distinguished by a phenotype-based detection method. At present, an automatic analyzer for microorganisms such as VITEK-2 and the like becomes the most common technical method for strain analysis in clinical hospitals by virtue of the advantages of high efficiency and convenience, but certain defects exist in accurate identification of acinetobacter, VITEK 2 can only identify limited acinetobacter, and acinetobacter baumannii, acinetobacter calcoaceticus, acinetobacter cutaneus and acinetobacter in hospitals with similar biochemical characteristics are difficult to distinguish, and can only be generally identified as calcium acetate-acinetobacter baumannii composite bacteria.

Determination of Acinetobacter in addition to biochemical methods, several methods in molecular biology have been developed, researchers developed methods for amplifying the gyrB gene by PCR, or multiplex PCR for amplifying intergenic regions of the 16S-23S rRNA gene and for identifying and distinguishing Acinetobacter species using the genes gyrB and recA, and sequencing-based methods such as 16S rRNA gene, rpoB gene, gyrB gene and recA gene have been developed in recent years with the rise of sequencing. These molecular methods all only utilize partial gene sequences of strains, and because of single information, such methods have defects in identifying genetically similar strains, and although some strains of acinetobacter can be distinguished, the strains in the calcium acetate-acinetobacter baumannii complex group cannot be accurately and effectively distinguished.

In addition, Matrix-Assisted Laser Desorption Ionization Time-of-Flight Mass Spectrometry (MALDI-TOF MS) is one of the most widely applied Mass Spectrometry techniques in clinical laboratories at present, and has wide and large-scale application in microbial detection and identification. MALDI-TOF MS has fast and reliable, simple and economic advantage compared with biochemical phenotype analysis and molecular experiment method, still have some deficiency, the accuracy of the method relies on the information content of the strain in the data processing analysis software and mass spectrum database that possess, many studies point out that using MALDI-TOF MS's standard analysis procedure to identify calcium acetate-Acinetobacter baumannii complex flora, the wrong identification result will appear, for example confirm Acinetobacter baumannii hospital, this is mainly because the mass spectrum analysis map volume is less in the domestic strain database at present, thus influenced its accuracy.

At present, the rapid development of Whole Genome Sequencing (WGS) technology has allowed us to identify microorganisms using Whole Genome Sequencing data. The development of methods for identifying and analyzing microorganisms using the entire genomic information of the strain, including the non-coding regions, would have extremely high resolution. The identification method based on WGS whole genome sequencing data can accurately distinguish the strains of the calcium acetate-Acinetobacter baumannii complex group from other strains of the Acinetobacter. Recent studies have shown that several species of the calcium acetate-acinetobacter baumannii complex show distinct characteristics in sensitivity to antibiotics, pathogenicity and clinical manifestation, can accurately identify acinetobacter baumannii, and has important significance in treatment, prognosis and monitoring of nosocomial infection by distinguishing other species of the calcium acetate-acinetobacter baumannii complex.

Disclosure of Invention

The invention aims to provide a calcium acetate-acinetobacter baumannii complex identification method based on splicing-free assembly WGS data.

In a first aspect, the invention claims a method for species identification of bacteria to be identified based on no-splice assembly WGS data.

The method for identifying the species of the bacteria to be identified based on the WGS data without splicing assembly, which is claimed by the invention, comprises the following steps:

(A) acquiring all sequenced bacteria genome data from a bacteria genome database, and establishing a relation between strain genome information and strain taxonomy information;

(B) the strain fingerprint feature database is constructed according to the following steps:

(b1) aiming at the whole genome nucleic acid sequence of each strain obtained in the step (A), the whole genome nucleic acid sequence is segmented (L-k +1 nucleic acid fragments can be segmented from a genome with the length of L) by taking the nucleic acid sequence length as k (namely k bp lengths) and the step length as 1 (namely 1bp length), only one fragment with repeated sequence is reserved after segmentation, and the obtained fragment set is called as a set A; obtaining one said set a for each strain; the fragments that make up the set A are referred to as fragment A; wherein k and L are both positive integers greater than 1.

(b2) The four base alphabet representation is converted into a number of 00, 01, 10, 11 for storage, so that each of the segments A can be converted into a 2k digit number, i.e., 0 to 22k-a number between 1;

this storage only occupies k/4 bytes (2k bits), while using the original character (A, T, C, G represents 4 bases) storage, one of the segments A occupies k bytes with a compression ratio of 4 times, and by this, the comparison of character strings can be converted into a numerical query in the subsequent calculation, which is significantly faster.

(b3) Traverse all 0 to 22kAnd (4) uniformly recording all the segments A in the set A of all the strains according to the number between-1, recording 2k digits corresponding to each segment A and corresponding species information, screening out 2k digits corresponding to the segments A only recording 1 strain, and storing according to strain classification to obtain the strain fingerprint feature database.

Further, step (b3) may be implemented by writing a computer program in python or other language. Traverse all 0 to 22k-numbers between 1 (i.e. the segmentation yields all possible said segments a), and using the numbers as keys to build a dictionary; then all the fragments A (number storage) in the set A of all the strains are recorded in a dictionary, after a number (namely the fragment A) is found, the TaxID representing the corresponding strain is recorded in the dictionary of the corresponding fragment A, and if the number (namely the fragment A) is found in other strains in the traversal process, the key is directly deleted in the dictionary (namely the fragment A is removed). Until complete strain processing of all strains is achieved. And finally, screening 2k digits corresponding to the segment A only recording 1 strain, and storing according to strain classification to obtain the strain fingerprint feature database.

(C) Cutting each sequencing read obtained by whole genome sequencing of the strains to be identified according to the method in the step (b1) to obtain the set A of the strains to be identified; converting all of said fragments a in said pool a of said test strains to be identified into 2k digits in the manner of step (b 2); then comparing the result with the strain fingerprint characteristic database obtained in the step (b3), calculating Score according to the comparison result, and taking the strain with the highest Score as the strain to which the strain to be identified belongs;

the strain to be identified is used for representing the 2k digits of the fragment A and a certain strain A in the strain fingerprint feature database is used for representing the 2k digits of the fragment A, and the more the absolute number of the intersection of the two is, the higher the score of the strain A is for the strain to be identified;

the higher the number ratio of the 2k digits of the strain A to be identified to the 2k digits of a certain strain A in the strain fingerprint characteristic database, the higher the number ratio of the intersection of the two numbers to the 2k digits of the strain A in the strain fingerprint characteristic database, the higher the score of the strain A for the strain to be identified.

In practical applications, the Score of Score is determined by weighted summation by combining the 2k digits of the strain to be identified for representing the fragment a with the 2k digits of a certain species a in the species fingerprint characteristic database and the absolute number of the intersection of the two numbers and the 2k digit of the fragment a in the species fingerprint characteristic database, and the proportion of the intersection of the two numbers in the strain to be identified for representing the 2k digits of the fragment a in the species fingerprint characteristic database.

Specifically, in step (C), the Score can be calculated according to the following formula:

Score=α*normalization(N)+(1-α)*normalization(P)

N=card(Sx∩Si)

M=card(Si)

wherein S isxA set of 2k digits representing the strain to be identified for representing the fragment A; siA set of 2k digits representing a species in said species fingerprint feature database for representing said fragment a; α is a weighting coefficient; card represents the number of elements in the solution set; n represents the number of elements in the intersection of the 2k digit collection of the strain to be identified and used for representing the fragment A and the 2k digit collection of a certain strain in the strain fingerprint characteristic database; m represents the number of elements in the 2k digit collection of a certain strain in the strain fingerprint characteristic database for representing the fragment AAnd (4) counting.

According to our test, the parameter α is 0.48, and the identification result is the best. N, P need normalization processing as follows:

Nmin,Nmax,Pmin,Pmaxthe minimum value and the maximum value of N and P are obtained after calculation with all strains in the strain fingerprint characteristic database.

In the method, step (a) may be performed as follows: acquiring the whole Genome sequence data of the bacteria with Complete Genome state Status as Complete Genome from a Genome database of NCBI; and acquiring metadata information of biological classification (Taxonomy) from a Taxonomy database of NCBI, and establishing a relation between strain genome information and strain Taxonomy information according to the taxID.

In step (b1), k should have a high enough specificity, and considering the occupied storage space, length 16 is selected to be convenient for storing as a 32-bit integer during programming. Therefore, in an embodiment of the present invention, k is specifically 16.

In step (b1), the segmentation of the whole genome nucleic acid sequence may be specifically performed using a scripted program written in python or other language.

In the step (a), the bacterium may be a bacterium.

In the method, the bacterium to be identified may be a bacterium. Further, the bacterium may be acinetobacter baumannii. Still further, the acinetobacter baumannii may be acinetobacter baumannii belonging to the calcium acetate-acinetobacter baumannii complex.

In a second aspect, the invention claims a method of constructing a database of fingerprint characteristics of bacterial species.

The method for constructing a fingerprint database of bacterial species as claimed in the present invention may comprise step (a) and step (B) of the method of the first aspect.

In a third aspect, the invention claims a strain fingerprint feature database constructed by the method of the second aspect.

In a fourth aspect, the invention claims application of the strain fingerprint feature database in the third aspect in species identification of bacteria to be identified based on splicing-free assembly WGS data.

Wherein the bacteria to be identified may be bacteria. Further, the bacterium may be acinetobacter baumannii. Further, the Acinetobacter baumannii may be Acinetobacter baumannii belonging to the calcium acetate-Acinetobacter baumannii complex, and specifically may be Acinetobacter baumannii (Acinetobacter baumannii), Acinetobacter calcoaceticus (Acinetobacter calcoaceticus), Acinetobacter cutaneus (Acinetobacter pittii), or Acinetobacter hospital (Acinetobacter nosocomialis).

In a specific embodiment of the present invention, the Acinetobacter baumannii may be, in addition to the above-mentioned 4 Acinetobacter baumannii belonging to the calcium acetate-Acinetobacter baumannii complex, Acinetobacter haemolyticus (Acinetobacter _ haemolyticus), Acinetobacter johnsonii (Acinetobacter _ johnsonii), Acinetobacter junii (Acinetobacter _ junii), Acinetobacter oleander (Acinetobacter _ oleivorans), Acinetobacter schendleri (Acinetobacter _ schendleri), or Acinetobacter agricus (Acinetobacter _ soli).

In a specific embodiment of the invention, the bacterium may be a bacterium of another species, such as any of the at least 74 bacteria shown in fig. 3, in addition to acinetobacter baumannii as described above.

If the bacteria to be identified are determined to be calcium acetate-Acinetobacter baumannii complex strains, the step (C) only needs to calculate the Score values of all the strains in the calcium acetate-Acinetobacter baumannii complex in the database for sorting.

The method provided by the invention does not need to assemble sequencing reads, so that the method is very simple and quick, and utilizes the information of the whole genome, and on the other hand, because the method constructs a bacterial fingerprint characteristic database containing 2279 species, the method not only can be used for identifying the strains in the calcium acetate-acinetobacter baumannii complex group, but also can be applied to the identification of other complex group strains or other strains.

Drawings

FIG. 1 is a schematic diagram of the identification method of the present invention.

FIG. 2 is a relationship between the identification accuracy of calcium acetate-Acinetobacter baumannii complex strain and the mapping ratio change of sequencing data.

FIG. 3 shows the identification accuracy of other 74 species of common strains.

Detailed Description

Data, tools, and the like used in the following examples are commercially available unless otherwise specified.

The invention designs a method for identifying calcium acetate-acinetobacter baumannii complex strains by directly using whole genome WGS sequencing reads without assembly and splicing.

The basic principle of the invention is to establish a complete strain genome fingerprint characteristic database, then directly break WGS sequencing reads of strains to be identified into fragment sequences, and compare and score the fragment sequences with various strains in the characteristic fingerprint database, thereby realizing the identification of the strains of the compound group.

The principle and flow chart of the identification method of the invention are shown in figure 1.

The method designed by the invention specifically comprises the following steps:

1. obtaining complete genome data of all sequenced bacteria

All available bacterial whole Genome sequence data are obtained from NCBI (national Center for Biotechnology information) Genome database, first obtaining the meta information of the sequencing data according to ftp. And acquiring the whole genome nucleic acid sequence data of the strain.

Obtaining metadata information of a biological classification (Taxonomy) from a Taxonomy database of NCBI, linking: the data provides the taxonomic information of the species, and the species and the whole taxonomic information of the strain can be obtained. And establishing a relation between strain genome information and strain taxonomy information according to the TaxID.

2. Constructing fingerprint feature fragment database of each strain

And (3) obtaining the fingerprint characteristic fragment of each strain by adopting the following steps on the obtained whole genome nucleic acid sequence of the strain, thereby constructing a fingerprint fragment database covering all bacterial strains capable of obtaining whole genome data:

(1) the method comprises the steps of fragmenting the obtained whole genome nucleic acid sequence of each strain of bacteria, assuming that the length of the genome nucleic acid sequence of a certain strain is L, selecting nucleic acid fragments with the base length of kbp as characteristic fragments, segmenting the whole genome nucleic acid sequence with the step length of 1bp, dividing the genome with the length of L into L-k +1 nucleic acid fragments, collecting the fragments to be called a set A, namely all substrings with the length of k, obtaining one set A by each strain, wherein the fragments forming the set A are called the fragments A, and only 1 fragment A repeated by each strain is reserved after segmentation. Wherein k and L are both positive integers greater than 1.

(2) In order to realize the rapid comparison between the fingerprint fragment and the database, the invention converts the letter representation of the base into the number storage, and 4 bases are respectively represented as follows: a: 00, C: 01, G: 10, T: 11 such that each kbp-length base fragment is converted into a number of 2k bits, i.e., 0 to 22k-1, such that the storage occupies only k/4 bytes (2k bits), while using the original character storage (A, T, C, G for 4 bases), a k-mer fragment occupies k bytes with a compression ratio of 4 times, and by this means that the comparison of strings can be converted into a numerical query in subsequent calculations, with a significant increase in speed.

(3) Obtaining fingerprint segments with representative characteristics of each strain according to the following steps:

(ii) all the fragments A are 0 to 2 after the treatment according to (2)2k-1, using the number as the key of the fragment A to establish a dictionary, and traversing all the fragments obtained by cutting each strain in each strainA (digital storage), after finding a number (namely the fragment A), recording the TaxID representing the strain in a dictionary corresponding to the fragment A, and if finding the fragment A in other strains in the traversal process, directly deleting the whole key in the dictionary, namely removing the fragment A. Until complete strain processing of all strains is achieved.

② traversing all the numbers remained in the dictionary/the segment A (0-2)2k-1), screening the segments A only recording 1 strain (digital storage), and storing according to each strain classification, thus obtaining a fingerprint feature segment set (digital storage) of each strain, and forming a fingerprint feature segment database covering more than 2000 strains.

3. Fragmenting whole genome sequencing data of strains to be identified

Each sequencing read in the WGS sequencing data fastq file of the strain to be identified is cut into base fragments with the length of k according to a sliding window of step-1, so that the number of the base fragments processed by each sequencing read is as follows: and (2) reading the length of-k +1, processing all reading lengths, removing repeated base fragments to obtain a set A with no redundant kbp length, converting each fragment A with the length of kbp into a digital number according to the method described in the step (2) of constructing the fingerprint database, and obtaining all the set A of the genome of the strain to be identified.

4. Comparison with the bacterial fingerprint database

Comparing the whole genome fragment set of the strain to be identified with the constructed strain fingerprint database, and judging the strain to which the strain to be identified belongs according to the comparison result, wherein the specific method comprises the following steps: and calculating Score according to all the set A of the strains to be identified and the fingerprint fragment set of each strain in the fingerprint database, ranking the results, and considering the strain with the highest Score as the strain to which the strains to be detected belong, namely the identification result.

The principle of the scoring function is to consider these two factors: 1. the absolute number of the intersection of the strain to be detected and the strain fingerprint fragment in the database, and 2, the proportion of the intersection to the number of the fragments in a certain strain fingerprint database. We combine these two factors to determine a Score by weighted summation, with a scoring function Score designed to:

Score=α*normalization(N)+(1-α)*normalization(P)

N=card(Sx∩Si)

M=card(Si)

wherein S isxA set of 2k digits representing the strain to be identified for representing the fragment A; siA set of 2k digits representing a species in said species fingerprint feature database for representing said fragment a; α is a weighting coefficient; card represents the number of elements in the solution set; n represents the number of elements in the intersection of the 2k digit collection of the strain to be identified and used for representing the fragment A and the 2k digit collection of a certain strain in the strain fingerprint characteristic database; m represents the number of elements in a 2k digit collection used for representing the fragment A by a certain strain in the strain fingerprint characteristic database.

According to our test, the parameter α is 0.48, and the identification result is the best. N, P need normalization processing as follows:

Nmin,Nmax,Pmin,Pmaxthe minimum value and the maximum value of N and P are obtained after calculation with all strains in the strain fingerprint characteristic database.

The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.

Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

The following examples describe the steps of the present invention in detail, and the examples are directed to the identification of publicly available downloadable sequencing data, it being apparent that the method is applicable to the WGS sequencing data of strains obtained in any manner. The implementation of the present invention is implemented by writing a python computer program.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种研究复合膜界面聚合反应机理的耗散粒子动力学方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!