Method for identifying individual intestinal flora type based on SNP

文档序号:1467596 发布日期:2020-02-21 浏览:14次 中文

阅读说明:本技术 一种基于snp鉴定个体肠道菌群类型的方法 (Method for identifying individual intestinal flora type based on SNP ) 是由 宁康 秦季玥 朱雪 谭重阳 于 2019-11-06 设计创作,主要内容包括:本发明属于肠道微生物技术领域,特别涉及一种基于SNP鉴定个体肠道菌群类型的方法,包括如下步骤:S1,获得纵向序列上的个体肠道菌群的测序数据,并对所有的物种进行分析得到物种丰度表;S2,筛选肠道菌群的主要组成成分;S3,分析、挖掘肠道菌群的SNP;S4,鉴定个体肠道菌群类型,指导肠道菌群健康预警。本发明的基于SNP鉴定个体肠道菌群类型的方法基于微生物组学和生物信息学思路,分析、挖掘具有季节循序行的物种SNP位点进行研究,具有高灵敏度和选择性,并且检测通量高,能够鉴定个体肠道菌群类型,指导肠道菌群健康预警,可用于监测、评估人体的健康状况。(The invention belongs to the technical field of intestinal microorganisms, and particularly relates to a method for identifying individual intestinal flora type based on SNP, which comprises the following steps: s1, obtaining sequencing data of individual intestinal flora on the longitudinal sequence, and analyzing all species to obtain a species abundance table; s2, screening main components of the intestinal flora; s3, analyzing and excavating SNP of intestinal flora; and S4, identifying the type of the individual intestinal flora, and guiding the health early warning of the intestinal flora. The method for identifying the individual intestinal flora type based on the SNP analyzes and excavates species SNP sites with seasonal sequential behavior for research based on the microbial omics and bioinformatics thinking, has high sensitivity and selectivity, has high detection flux, can identify the individual intestinal flora type, guides the intestinal flora health early warning, and can be used for monitoring and evaluating the health condition of human bodies.)

1. A method for identifying individual gut flora type based on SNP, comprising the steps of:

s1, obtaining sequencing data of individual intestinal flora on the longitudinal sequence, and analyzing all species to obtain a species abundance table;

s2, screening main components of the intestinal flora;

obtaining species abundance information of the reference sequence set in step S1 by adopting MetaPhlAn2 software, and selecting species existing in at least 3 samples;

obtaining the site depth of each sample by using a SAMtools depth command, and calculating the average sequencing depth of each species;

selecting species with the average sequencing depth of not less than 10 in at least 3 samples, and calculating the number of SNP (single nucleotide polymorphism) of the species in each sample;

screening sample genes with the coverage rate not less than 8 according to the sequencing data of the individual intestinal flora of the speces in the step S1, calculating the SNP number of the sample genes in each sample, and determining the main components and the subspecies components of the intestinal bacteria;

s3, analyzing and excavating SNP of intestinal flora;

extracting whole genome SNP sites and corresponding allele frequencies of species distributed with seasonal circulation patterns according to the main composition and the subspecies composition of the intestinal bacteria determined in the step S2, obtaining an SNP frequency matrix only by considering the SNP of which the allele frequency is more than 0.2, calculating Manhattan distances between every two samples, then carrying out hierarchical clustering analysis based on the longest distance, and carrying out Wilcoxon rank sum inspection after mining the SNP with seasonal circulation;

s4, identifying the type of the individual intestinal flora, and guiding the health early warning of the intestinal flora;

and mapping the protein sequence with seasonal circulating SNP in the step S3 to a KEGG database, comparing, obtaining the biological path information involved in the comparison according to the highest comparison score result, and guiding the health early warning of intestinal flora according to the dynamic change of intestinal flora.

2. The method for identifying the type of intestinal flora of an individual based on SNP according to claim 1, wherein the specific operation of step S1 is as follows:

downloading whole genome sequencing data of the intestinal microorganisms, carrying out format conversion and quality control treatment on the obtained sra data file, analyzing all species, and merging species abundance tables.

3. The method for SNP-based identification of individual gut flora types according to claim 2, wherein the gut microorganism whole genome sequencing data is shotgun sequencing data of Illumina HiSeq 4000 from NCBI SRA database.

4. The method for identifying the type of intestinal flora of an individual based on SNP according to claim 2, wherein the sra data file is formatted by the following steps:

the original sra file was converted to a fastq file containing base composition information and base sequencing quality information for the sequence using the fastq-dump command from sra tools.

5. The method for identifying the type of intestinal flora of an individual based on SNP according to claim 2, wherein in step S1, the specific operation of the quality control process is as follows:

quality control processing was performed on the raw sequencing data using trimmatic, the SE parameters specifying single ended sequencing data, linker removal using illuminalip parameters, bases with base mass below 5 starting from the head of the sequence using leader parameters, bases with base mass below 5 starting from the tail of the sequence using TRAILING parameters.

6. The method for identifying the type of intestinal flora based on SNP according to claim 1, wherein the step S2 comprises determining the major components and the subspecies components of the intestinal flora according to a phylogenetic tree or a cluster analysis based on mutation frequency.

7. The method for SNP-based identification of individual gut flora types according to claim 6, wherein the operation of mapping the phylogenetic tree is:

using ASC _ GTRGAMMA nucleic acid substitution model, carrying out rapid bootstrap analysis by using a 'f a' option parameter algorithm of RAxMLHPC, using a Lewis method to carry out ascertainment bias correction, adopting a random sampling method to form a new sequence, then carrying out sequence alignment, and repeating the process more than 50 times.

8. The method for identifying the intestinal flora type of an individual based on SNP according to claim 1, wherein the step S3 of extracting the genome-wide SNP sites and the corresponding allele frequencies comprises:

(1) and (3) comparison stage: aiming at the sequence length of 151bp, firstly, using a Burrows-Wheeler Aligner to build an index for a reference genome, then using a BWA MEM algorithm to complete comparison of simulation data, using a-R parameter to add an RG part in the sam file annotation information, and outputting a sam file;

(2) a pretreatment stage: firstly, using an SAMtools view command to convert the sam file into a bam file in a binary format, and then using an SAMtools sort command to sort the bam file according to the scaffold position; secondly, removing repetition of the sorted bam files by using picard; then, using SAMtools index command to build an index for the removed and repeated bam file; and finally, carrying out SNP-positioning on the bam file after the duplication is removed by using a VarScan2mpileup2SNP command to obtain a vcf file of the variable site information.

9. The method for identifying the type of an individual intestinal flora based on SNP according to claim 8, wherein the data recording part of the vcf file consists of a plurality of columns divided by a space bar, and the first eight columns represent the related information of the mutation sites and are respectively as follows: chromosome name or scaffold name, position of the variant locus on the chromosome, ID number of the variant locus in the existing database, reference base, variant base, quality score, whether the filter criteria is passed or not, and related information; each column thereafter represents information for a certain sample at that site.

10. The method for identifying the intestinal flora type of an individual based on SNP according to claim 1, wherein in the step S3, the step of hierarchical cluster analysis is:

the marker genes identified from the reference Genome in gff of NCBI Genome database by adopting MetaPhlAn2 software are used for obtaining species composition and abundance information of the community at the speces level, merging species abundance tables, and then extracting species information to obtain species abundance information of all samples at the speces level.

Technical Field

The invention belongs to the technical field of intestinal microorganisms, and particularly relates to a method for identifying individual intestinal flora type based on SNP.

Background

The large and complex dynamic microflora in the human gut have profound effects on the human's own metabolic phenotype, including archaea, bacteria, viruses and fungi, with over 1000 species of microorganisms. The population often has different and highly variable intestinal flora between individuals, however, the current theory holds that the population also shares a group of conserved microbial populations and genes, which may be necessary for the normal function of the intestinal tract.

Human intestinal microorganisms are mainly composed of five types of bacteria and one type of archaea (Euryarchaeota). The five major groups of bacteria are: firmicutes, Bacteroidetes, actinomycetes, Proteobacteria and Verrucomicrobia. The phylum firmicutes include the genera Ruminococcus (Ruminococcus), Clostridium (Clostridium), Lactobacillus (Lactobacillus, some of which are probiotics), Eubacterium (Eubacterium, producing butyrate), Clostridium (Roseburia), etc.; bacteroides include species of Bacteroides (Bacteroides), Prevotella (Prevotella), etc., which degrade complex polysaccharides; actinomycetes are mainly of the genus Bifidobacterium (certain strains belonging to the genus probiotic) [ Functional interactive between the gut microbiota and host microorganisms, Nature 489(7415) (2012) (242) -2490 ].

These intestinal microflora play important roles in several areas: 1. elimination of pathogens protects The host, as by studies on mouse models of Salmonella infection, engt K et al, found that Gut microbes not only block pathogen invasion, but also mediate elimination of pathogens early in infection [ The microbial medias pathognomonean from The Gut Lunean after Non-typhoid Salmonella Diarrhea, plopatogens 6(9) (2010) e1001097 ]; bifidobacterium can prevent pathogenic infection of the intestinal tract by producing acetate [ Bifidobacterium can protect from intestinal pathogenic infection by pathogenic infection of acetate, Nature 469(7331) (2011) 543-; 2. mediating Immune functions, such as cyclophosphamide (a clinically important Anticancer drug) can alter The composition of gut Microbiota and induce The transfer of certain classes of gram-positive bacteria to secondary lymphoid organs, which can stimulate The production of T-helper17 cells and elicit a memory Immune response [ The intellectual Microbiota models The Anticancer antibodies of cyclophosphamide, Science 342(6161 (2013)) 971 ]; 3. regulation of metabolic processes, even those thought to be a negligible endocrine component, such as gut microbiome, changes in composition in obese people and can respond to changes in body weight. The gut of obese people has more firmicutes and fewer bacteroidetes, and after weight loss by diet control, bacteroidetes levels increase, which means bacteroides may respond to calorie intake [ Human gutmicrobes associated with obesity, Nature 444(7122), (2006) 1022-.

At present, it is generally accepted that many factors can affect the species composition and diversity of gut microorganisms, such as diet, age, geographical location, drugs and environmental substances. The effects of these factors may be long-term or transient. In one study, it was found that long term dietary differences may contribute to differences in gut microbiome between U.S. populations, while short term dietary changes within individuals may also alter species composition [ Application of microorganisms in the human genome, World J Gastroenterol 21(3) (2015) 803-.

With the development of sequencing technology and bioinformatics analysis platform, species analysis of microorganisms has higher resolution, and the research of intestinal bacteria is also deepened from the phylum level of the traditional separation culture technology research to a higher level. The structural variation of the genome includes Single Nucleotide Polymorphism (SNP), small fragment insertion and deletion (Indel, length is usually below 50 bp), and large structural variation (including insertion or deletion of a sequence with length above 50bp, inversion of chromosomes, sequence translocation within or between chromosomes, copy number variation, etc.). In the case of microorganisms, structural variations of the genome may give rise to alterations in the phenotype of the same microorganism, such as alterations in the resistance to antibiotics [ Impact of gyrA and partial microorganisms on genomic resistance, doubling time, and synergistic development of Escherichia coli, antibiotic Agents and chemitherapy 43(4) (1999)868] and alterations in pathogenicity [ Pathogenic adaptation of em > Escherichia coli(s) are woven Em > by natural variation of the FimH adhesin, Proceedings of the National Academy of Sciences95(15) (1998)8922], which may reflect the response of microorganisms to environmental selection pressure, it is therefore necessary to differentiate the species of microorganisms.

Therefore, it is very necessary to identify the type of the individual intestinal flora through Single Nucleotide Polymorphisms (SNPs) by using methods of microbiology and bioinformatics, so as to guide the health early warning of the intestinal flora.

Disclosure of Invention

In order to solve the problems, the invention provides a method for identifying individual intestinal flora type based on SNP, which is based on the thinking of microbiology and bioinformatics, analyzes and excavates species SNP sites with seasonal sequential behavior for research, has high sensitivity and selectivity, has high detection flux, can identify individual intestinal flora type, guides the intestinal flora health early warning, and can be used for monitoring and evaluating the health condition of human body.

The invention is realized by adopting the following technical scheme:

a method for identifying individual intestinal flora type based on SNP comprises steps S1-S4:

s1, obtaining sequencing data of individual intestinal flora on the longitudinal sequence, and analyzing all species to obtain a species abundance table;

further, the specific operation of the step is as follows: downloading whole genome sequencing data of the intestinal microorganisms, performing format conversion and quality control treatment on the obtained sra data file, analyzing all species, and merging species abundance tables;

further, the whole genome sequencing data of the intestinal microorganisms is shotgun sequencing data of Illumina HiSeq 4000, which is from NCBI SRA database;

further, the specific operation of format conversion for sra data file is as follows: converting an original sra file into a fastq file containing base composition information and base sequencing quality information of a sequence by using a fastq-dump command of sra tools;

further, the quality control process specifically includes: quality control processing was performed on the raw sequencing data using trimmatic, the SE parameters specifying single ended sequencing data, linker removal using illuminalip parameters, bases with base mass below 5 starting from the head of the sequence using leader parameters, bases with base mass below 5 starting from the tail of the sequence using TRAILING parameters.

S2, screening main components of the intestinal flora;

obtaining species abundance information of the reference sequence set in step S1 by adopting MetaPhlAn2 software, selecting species existing in at least 3 samples, obtaining the site depth of each sample by using a SAMtools depth command, and calculating the average sequencing depth of each species; selecting species with the average sequencing depth of not less than 10 in at least 3 samples, and calculating the number of SNP (single nucleotide polymorphism) of the species in each sample; screening sample genes with the coverage rate not less than 8 according to the sequencing data of the individual intestinal flora of the speces in the step S1, calculating the SNP number of the sample genes in each sample, and determining the main components and the subspecies components of the intestinal bacteria;

further, the method for determining the main components and the subspecies components of the enteric bacteria comprises the steps of drawing a phylogenetic tree or carrying out cluster analysis according to mutation frequency.

Further, the operation of drawing the phylogenetic tree is: using ASC _ GTRGAMMA nucleic acid substitution model, the 'f a' option parameter algorithm of RAxMLHPC to perform fast bootstrap analysis, using Lewis method to perform astrertiment bias correction, adopting random sampling method to form new sequence, then performing sequence alignment, repeating the process more than 50 times, preferably repeating the process 80 times, 100 times or 120 times.

S3, analyzing and excavating SNP of intestinal flora;

extracting whole genome SNP sites and corresponding allele frequencies of species distributed with seasonal circulation patterns according to the main composition and the subspecies composition of the intestinal bacteria determined in the step S2, obtaining an SNP frequency matrix only by considering the SNP of which the allele frequency is more than 0.2, calculating Manhattan distances between every two samples, then carrying out hierarchical clustering analysis based on the longest distance, and carrying out Wilcoxon rank sum inspection after mining the SNP with seasonal circulation;

further, the whole genome SNP locus and the corresponding allele frequency extraction operation comprise: (1) and (3) comparison stage: aiming at the sequence length of 151bp, firstly, using Burrows-Wheeler Aligner (BWA) to index a reference genome, then using a BWA MEM algorithm to complete the comparison of simulation data, using-R parameters to add RG (read group) part in the annotation information of the sam file, and outputting the sam file; (2) a pretreatment stage: firstly, using an SAMtools view command to convert the sam file into a bam file in a binary format, and then using an SAMtools sort command to sort the bam file according to the scaffold position; secondly, removing repetition of the sorted bam files by using picard; then, using SAMtools index command to build an index for the removed and repeated bam file; finally, carrying out SNP-positioning on the bam file after the duplication is removed by using a VarScan2mpileup2SNP command to obtain a vcf file of the variable locus information;

furthermore, the data recording part of the vcf file is composed of a plurality of columns divided by a space key, and the first eight columns represent the relevant information of the mutation sites, which are respectively: chromosome name or scaffold name (for bacteria), location of variant site on chromosome, ID number of variant site in existing database (which may be indicated with ". when absent), reference base, variant base, quality score, whether or not to pass filtering criteria, relevant information (e.g., depth of sequencing); each column thereafter represents information (e.g., mutation frequency) of a certain sample at that site;

further, the hierarchical clustering analysis comprises the following steps: the marker genes identified from a reference Genome in gff (genetic feature format) of an NCBI Genome database by adopting MetaPhlAn2 software are used for obtaining species composition and abundance information of the community at the speces level, merging the species abundance tables, and then extracting the species information to obtain the species abundance information of all samples at the speces level.

S4, identifying the type of the individual intestinal flora, and guiding the health early warning of the intestinal flora;

and mapping the protein sequence with seasonal circulating SNP in the step S3 to a KEGG database, comparing, obtaining the biological path information involved in the comparison according to the highest comparison score result, and guiding the health early warning of intestinal flora according to the dynamic change of intestinal flora.

The invention has the beneficial effects that:

1. the method for identifying the individual intestinal flora type based on the SNP takes the individual intestinal flora as a research object, analyzes and excavates species SNP sites with seasonal sequential behavior for research based on the microbial omics and bioinformatics thinking, and has high sensitivity and selectivity and high detection flux; the NCBI database is an international biotechnology information center, GeneBank established by the NCBI database is one of three biological sequence information databases in the world, and the database for acquiring professional information has high authority and universality in the field to which the database belongs.

2. By the method for identifying the individual intestinal flora type based on the SNP, the dynamic change of the intestinal flora can be theoretically predicted so as to monitor and evaluate the health condition of human bodies.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic representation of an authentication procedure according to the present invention;

FIG. 2 is a composition heatmap of the species of Hazaar human gut microbiota;

FIG. 3 is a 12 box plot of species abundance with seasonal cycles, where abundance represents abundance and season represents season;

FIG. 4 is a graph of SNP number distribution over seasons for 15 species (sample coverage >3), where SNP intensity denotes the number of SNPs and season denotes the season;

fig. 5A is a phylogenetic tree diagram based on e.hallii whole genome SNPs;

fig. 5B is a phylogenetic tree diagram based on e.biform genome-wide SNPs;

fig. 6A is a graph of e.hallii-based cluster analysis of genome-wide SNPs;

fig. 6B is a graph of e.biform genome-wide SNP-based clustering analysis;

wherein, in fig. 3, 4, 5A, 5B, 6A and 6B, 2013ry represents the dry season of 2013; 2014wte represents the 2014 rainy season; 2014dry represents 2014dry season;

FIG. 7 is a bar graph of the KEGG pathway distribution involved in the identified genes, where pathway represents the KEGG pathway.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

The experimental procedures in the following examples are conventional unless otherwise specified. The experimental materials used in the following examples were all commercially available unless otherwise specified.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:植物多基因控制性状的单基因分离群体构建方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!