Automatic typing method based on human intestinal flora

文档序号:1364335 发布日期:2020-08-11 浏览:24次 中文

阅读说明:本技术 一种基于人肠道菌群的自动化分型方法 (Automatic typing method based on human intestinal flora ) 是由 王树伟 肖云平 史贤俊 林博 张建明 于 2020-04-20 设计创作,主要内容包括:本发明公开了一种基于人肠道菌群的自动化分型方法,采用LEfSe方式对聚类结果分组进行Biomarker筛选,然后确定具体肠型,结果全面,包含涉及到的聚类图、Biomarker筛选、肠型boxplot图展示,可自动整理所有分析结果,每一步分析完成之后自动对结果进行汇总统计,可视化,而且,所有操作步骤可以溯源,方便错误查询,如果分析报错,会有对应的报错日志信息。(The invention discloses an automatic typing method based on human intestinal flora, which is characterized in that a LefSe mode is adopted to group clustering results for Biomarker screening, then a specific intestinal type is determined, the results are comprehensive, the clustering chart, the Biomarker screening and the intestinal type boxplot display are included, all analysis results can be automatically sorted, the results are automatically gathered and counted and visualized after each step of analysis is completed, all operation steps can be traced, error inquiry is facilitated, and if the analysis is carried out in an error report, corresponding error report log information exists.)

1. An automatic typing method based on human intestinal flora comprises the following steps:

1) preparing a genus-level species relative abundance table of all samples;

2) partitioning is carried out through a segmentation algorithm surrounding a central point, abundance distribution is clustered, and the best clustering number is screened by using a Calinski-Harabasz index;

3) verifying the clustering effect by a contour verification technology;

4) performing BCA inter-class analysis according to the optimal clustering number;

5) the species contributing most to the difference in each group was screened by LEfSe analysis as the gut type of each group and boxplot was drawn.

2. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 2), the Calinski-Harabasz index is defined as:

wherein B iskIs the sum of squares between clusters, WkIs the intra-cluster sum of squares, selected to be CKkThe number of k clusters with the largest value.

3. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 3), the contour width s (i) of each data point i is calculated by the following formula:

where a (i) is the average difference or distance of sample i from all other samples in the same cluster, b (i) is the average difference or distance of sample i from all objects in the nearest cluster,

the formula indicates that-1 ═ s (i) <1, a sample closer to its cluster has a higher value of s (i) than to its own cluster, whereas s (i) is close to 0 meaning that the given sample is located between the two clusters, and a large negative value of s (i) indicates that the sample is assigned to the wrong cluster.

4. The method for the automated typing of human intestinal flora based on claim 1, wherein in step 4), the BCA intergeneric analysis is performed using R and ade4 packages.

5. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the LDA score is obtained by detecting the difference function between different components through a rank-sum test and performing the dimension reduction through Linear Discriminant Analysis (LDA) and evaluating the influence of different species.

6. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the intestine type is named as G plus numeral form.

7. The method for the automated typing of human intestinal flora according to claim 1,

and 5) finding out the significant biomarkers among different clusters by adopting an LEfSe analysis process.

8. The method for the automated typing of human intestinal flora based on the claims 1 to 7, wherein in step 5), the software package ggplot2 in R language is used to draw a boxplot.

Technical Field

The invention relates to the field of high-throughput microbial sequencing, in particular to an automatic typing method based on human intestinal flora.

Background

In 2011, a scientific research institution in europe has analyzed the composition of intestinal microorganisms of 22 european people by using the difference of a bacterial gene, and has identified the composition of the microbial ecological group which is different between every two people and in the same person. Moreover, they also compared the microbial ecoset composition patterns of these europeans with those of japanese and american discovered earlier. As a result, it was found that the microbial ecogroups are not randomly combined, and the microbial ecogroups can be roughly classified into three types, also called enterotypes (enterotypes), among all the tested human groups, and scientists specifically classify them into Bacteroides type (Bacteroides), Prevotella type (Prevotella) and Ruminococcus type (Ruminococcus), which means that they respectively contain more Bacteroides, Prevotella or Ruminococcus. The same conclusion was reached by investigating a larger population (154 us and 85 danish), which could be divided into these three categories, which suggests that the possible number of microbial ecogroups that really survive very well in our intestine is not too great.

The MetaHIT alliance published a gut pattern found in the human gut microbiota in 2011 at 4 months (Arumugam, Raes et al, 2011). The data of the relevant research is public, and the theory behind the calculation process is explained in the supplementary information of the article. However, there is no report in the appendix (in the R environment) of the exact command set and a complete visual presentation of the specific gut type identification method that would enable anyone to replicate all the data in the article.

The existing intestinal type identification has the following defects:

(1) the intestinal type identification is ambiguous: the method for identifying the intestinal type of each clustering result is not clear;

(2) the results are shown incomplete: the analysis result is too simple, data mining is not deep enough, and visual display content corresponding to the data is lacked.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an automatic typing method based on human intestinal flora.

In order to achieve the purpose, the invention adopts the scheme that:

an automatic typing method based on human intestinal flora comprises the following steps:

1) preparing a genus-level species relative abundance table of all samples;

2) partitioning by a surrounding central point Partitioning Algorithm (PAM), clustering abundance distribution, and screening the optimal clustering number by using a Calinski-Harabasz (CH/Carlinsky-Harabas) index;

3) verifying the clustering effect by a contour verification technology;

4) performing BCA inter-class analysis according to the optimal clustering number;

5) the species in each group that contributed most to the difference were selected as the gut type of each group by LEfSe analysis,

and a boxplot is drawn.

Preferably, in step 2), the Calinski-Harabasz index is defined as:

wherein B iskIs the sum of squares between clusters, WkIs the intra-cluster sum of squares, selected to be CKkThe number of k clusters with the largest value.

Preferably, in step 3), the contour width s (i) of each data point i is calculated by the following formula:

where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster,

the formula indicates that-1 ═ s (i) <1, a sample closer to its cluster has a higher value of s (i) than to its own cluster, whereas s (i) is close to 0 meaning that the given sample is located between the two clusters, and a large negative value of s (i) indicates that the sample is assigned to the wrong cluster.

Preferably, in step 4), the BCA inter-class assay is performed using R and ade4 packaging.

Preferably, in step 5), the LDA score is obtained by detecting the difference function between different components by a rank sum test method and by implementing dimensionality reduction and evaluating the influence magnitude of different species by LDA (linear discriminant analysis).

Preferably, in step 5), the intestinal form is designated as the G plus numerical form.

Preferably, in step 5), LEfSe analysis procedure is adopted to find out the biomarkers with significance among different clusters.

Preferably, in step 5), the boxplot is drawn by using the ggplot2 software package in the R language.

The invention has the beneficial effects that:

and (I) screening a clustering result by using a LefSe mode through a Biomarker, and then determining a specific intestinal type.

And (II) the results are comprehensive and comprise related clustering maps, Biomarker screening and intestinal type boxplot display.

And (III) automatically sorting all analysis results, and automatically summarizing and counting the results after each step of analysis is finished, so that the results are visualized.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a graph of the optimal cluster number selection according to the present invention.

FIG. 3 is a graph of inter-class analysis clusters of the present invention.

FIG. 4 is a graph of inter-class analysis clusters with sample names according to the present invention.

FIG. 5 is a Biomarker bar graph of the present invention.

FIG. 6 is a diagram of an enteric boxplot of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention. In addition, technical features of various embodiments or individual embodiments provided by the present invention may be arbitrarily combined with each other to form a feasible technical solution, but must be based on the realization of the technical solution by a person skilled in the art, and when the technical solution combination is contradictory or cannot be realized, the technical solution combination should be considered to be absent and not to be within the protection scope of the present invention.

The invention provides an automatic typing method based on human intestinal flora, which is shown in figure 1 and comprises the following steps:

1. a file preparation step:

and obtaining a genus level species relative abundance table of different population samples according to high-throughput sequencing.

2. Selecting the optimal clustering number:

the invention uses a surrounding central point segmentation algorithm (PAM) to perform partitioning and cluster the abundance distribution. PAM is derived from the basic k-means algorithm, but has the advantage of supporting arbitrary distance measurements and is more straightforward than k-means. It is a supervised process in which a predetermined number of clusters is used as input to the process, which then divides the data into a plurality of clusters.

To evaluate the optimal number of clusters, the present invention uses the Calinski-Harabasz (CH/carlinsky-hardabas) index, which reveals good performance in recovering the number of clusters. Is defined as:

wherein B iskIs the sum of squares between clusters (i.e., the distance squares i and j between all points are not in the same cluster), WkIs the intra-cluster sum of squares (i.e., the distance squares i and j between all points are in the same cluster). This metric implements the idea of: when the distance between clusters is much greater than the distance inside the clusters, the better the clustering effect. Therefore, we choose to make CKkThe number of k clusters with the largest value.

3. A step of verifying clustering effect:

the cluster verification method is very useful for evaluating the cluster quality associated with the underlying data points. The invention herein uses a contour verification technique. The contour width s (i) of each data point i is calculated by:

where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, and b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster.

The formula indicates-1 ═ < s (i) ═ 1. A sample closer to its own cluster has a higher value of s (i) than s (i), whereas s (i) approaching 0 means that the given sample is located between the two clusters. A large negative value of S (i) indicates that the sample is assigned to the wrong cluster.

4. An inter-class analysis (BCA) step:

inter-class analysis (BCA) was performed to support clustering and to determine drivers of gut type. Analysis was performed using R and ade4 packaging. Prior to this analysis, in the Illumina dataset, if the average abundance of all samples was below 0.01%, very low abundance genera were removed to reduce noise. Inter-class analysis is a special case of principal component analysis, in which there is a tool variable that is a qualitative factor (i.e., gut type cluster). Inter-class analysis enables us to find the principal component first.

5. An inter-cluster LEfSe analysis step:

in order to screen functional biomarkers with significant differences among clusters, firstly, the difference functions among different components are detected by a rank sum test method, dimension reduction is realized by LDA (linear discriminant analysis), and the influence of different species is evaluated, so that the LDA score is obtained.

6. Displaying an intestinal type boxplot:

the species in each group that contributed most to the difference was selected as the gut type of each group by LEfSe analysis, and boxplot was drawn.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于数值模拟的UV-PAA耦合反应器设计优化方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!