OrthoMCL clustering result-based rapid analysis method

文档序号:1447842 发布日期:2020-02-18 浏览:39次 中文

阅读说明:本技术 一种基于OrthoMCL聚类结果的快速分析方法 (OrthoMCL clustering result-based rapid analysis method ) 是由 韩毛振 张雁 曹杰 汪栋 罗学才 于 2019-10-30 设计创作,主要内容包括:本发明公开了一种基于OrthoMCL聚类结果的快速分析方法,属于比较基因组学和生物信息学领域,该方法以OrthoMCL聚类结果为基础,建立自动对泛基因组分析中各类蛋白的识别,包括所有代表性蛋白质、核心蛋白质、单拷贝核心蛋白质和物种特异性蛋白质。基于这些蛋白质各自的分类,对这些分类的蛋白质在各自物种中存在的个数进行统计,并按照类别将结果进行输出。该方法实现了对各种分类中蛋白质的代表性序列的输出以及各类蛋白在每个物种中的代表性序列的输出。此外,该方法也将蛋白质同源聚类结果按照每一个同源蛋白质对应的序列进行输出,为实现泛基因组分析中更为高级的个性化分析奠定基础。(The invention discloses a rapid analysis method based on an OrthoMCL clustering result, which belongs to the field of comparative genomics and bioinformatics. Based on the respective classifications of these proteins, the number of these classified proteins existing in the respective species is counted, and the results are output according to the classifications. The method enables the output of representative sequences for proteins in various classes and for proteins of various classes in each species. In addition, the method also outputs the protein homologous clustering result according to the sequence corresponding to each homologous protein, thereby laying a foundation for realizing higher-level personalized analysis in pan-genomic analysis.)

1. A rapid analysis method based on OrthoMCL clustering results is characterized by comprising the following steps:

step S1, obtaining the nucleic acid sequence and the protein sequence of each species to be analyzed, carrying out homology clustering on the protein sequences of all the species to be analyzed by using OrthoMCL cluster analysis software, and outputting an OrthoMCL clustering result;

step S2, setting the number of species used in the pan-genome analysis to be N, and counting the number N1 of species contained in cluster of each corresponding cluster file in the OrthoMCL clustering result and the number m of proteins contained in each species in cluster of each corresponding cluster file1,m2,m3,……,mNObtaining OrthoMCL clustering parameters;

step S3, classifying the proteins according to the OrthoMCL clustering parameters;

and S4, outputting an analysis result according to the classification result of the step S3 and the species nucleic acid sequence of the step S1.

2. The OrthoMCL clustering result-based rapid analysis method according to claim 1, wherein the classifying the proteins according to the OrthoMCL clustering parameters comprises:

according to the distribution characteristics of the core protein in the species in the pan genomics OrthoMCL clustering results: if the number of species contained in the cluster is N1 ═ N, the cluster is the core protein of pan-genomic analysis, and the cluster file is output;

according to the distribution characteristics of the single copy core protein in the species in the pan genomics orthiomcl clustering results: if the number of species contained in the cluster is N1 ═ N and m1=m2=m3=…….=mN1, this cluster is a single copy core protein for pan-genomic analysis, the cluster file is exported;

according to the distribution characteristics of non-essential proteins in species in the pan-genomics OrthoMCL clustering results: number of species contained in the cluster N1<N and m1,m2,m3,…….,mNAt least two of the cluster are not 0, the cluster is a non-essential protein in the genome analysis, and the cluster file is output;

according to the distribution characteristics of specific proteins in species in the clustering result of the generic genomics OrthoMCL: number of species contained in the cluster N1<N and m1,m2,m3,…….,mNOne of them is 0, and this cluster is a specific protein for pan-genomic analysis, and the cluster file is output.

3. The OrthoMCL clustering result-based rapid analysis method according to claim 1, wherein the outputting of the analysis result comprises:

outputting the distribution of various types of proteins in each cluster, i.e. outputting m1,m2,m3,……,mNThe number of each type of protein is counted;

the nucleic acid sequences of each class of proteins in each cluster are exported, providing the output text for subsequent pan-genomic analysis.

Technical Field

The invention relates to the field of comparative genomics and bioinformatics, in particular to a rapid analysis method based on an OrthoMCL clustering result.

Background

Comparative Genomics (Comparative Genomics) is an evolutionary analysis of genomic data from different species, comparing known genes and genomic structures to resolve gene function and genetic mechanisms between genes and disease and phenotype (Comparative Genomics)Setubal et al, 2017, Shilei Zhao et al, 2019). With the rapid development of sequencing technologies, especially the development and innovation of second generation and third generation sequencing technologies, the genome of many species has been sequenced, and more species have population genome data of multiple samples on the species level. How to rapidly and effectively compare and analyze the genome sequencing data is a main research field for developing methods in comparative genomics research at present.

The analysis of pan-genome at present generally involves the following aspects: clustering analysis of homologous proteins, analysis of the results of the clustering of homologous proteins, tree building and evolutionary analysis of proteins, functional annotation of proteins (including but not limited to annotation of carbohydrate functions (CAZyme), annotation of protein functions (COG and GO annotations), and annotation of metabolic pathways (KEGG pathway), etc.). Among the developed pan-genomic analysis tools, there have been tools such as PGAP (Yongbing Zhaoet al, 2011) and EDGAR (j.yu et al, 2017) and panX (Wei Ding et al, 2018). These tools achieve the vast majority of what is needed in pan-genomic analysis, but the output results are generally after high integration. Due to the lack of corresponding intermediate process files, particularly the homologous clustering results of proteins and the corresponding statistical files and protein sequence files, the personalized analysis required in the pan-genomic analysis is difficult to carry out. Therefore, how to quickly and effectively analyze and count the homologous clustering results of the proteins in the pan-genomic analysis and output corresponding representative protein sequences by classifying the corresponding proteins (mainly relating to the core protein, the single-copy core protein, the non-essential protein and the specific protein in the pan-genomic analysis) provides an input file for subsequent analysis, which is an important prerequisite for realizing personalized analysis in the pan-genomic analysis. However, no specific methods are currently available, and it is necessary to develop such methods.

In pan-genomic analysis, achieving homologous clustering of all proteins within a species is the basis for performing subsequent analyses. Currently corresponding tools include OrthoMCL (https:// OrthoMCL. org/OrthoMCL /), BLAST and Diamond (Wei Ding et al, 2018), among others. Wherein OrthoMCL is used for searching orthologous genes and paralogous genes in pan-genomic analysis, has detailed teaching courses and is easy to use, and is a more widely used tool in the current pan-genomic analysis. The output result of the method is systematic and comprehensive, and the method is a basic file for determining each protein classification in pan-genomic analysis.

For the above reasons, in order to realize more advanced personalized analysis in pan-genomic analysis, it is necessary to establish fast and effective processing of protein clustering results. Therefore, it is necessary to provide a rapid and effective analysis method for the clustering result of the OrthoMCL by using comparative genomics and bioinformatics for the clustering result of proteins in the pan-genome.

Disclosure of Invention

In order to solve the problems, the invention provides a rapid analysis method based on an OrthoMCL clustering result, which aims to solve the problem that the prior art has no method for analyzing and counting the homologous clustering result of the protein in pan-genomic analysis, rapidly classifying the corresponding protein and outputting a corresponding representative protein sequence.

The invention is realized by adopting the following technical scheme:

the invention provides a rapid analysis method based on an OrthoMCL clustering result, which comprises the following steps:

step S1, obtaining the nucleic acid sequence and the protein sequence of each species to be analyzed, carrying out homology clustering on the protein sequences of all the species to be analyzed by using OrthoMCL cluster analysis software, and outputting an OrthoMCL clustering result;

step S2, setting the number of species used in the pan-genome analysis to be N, and counting the number N1 of species contained in cluster of each corresponding cluster file in the OrthoMCL clustering result and the number m of proteins contained in each species in cluster of each corresponding cluster file1,m2,m3,......,mNObtaining OrthoMCL clustering parameters;

step S3, classifying the proteins according to the OrthoMCL clustering parameters;

and S4, outputting an analysis result according to the division result of the step S3 and the species nucleic acid sequence of the step S1.

As a further optimization scheme of the present invention, the classifying the proteins according to the OrthoMCL clustering parameter includes:

according to the distribution characteristics of the core protein in the species in the pan genomics OrthoMCL clustering results: if the number of species contained in the cluster is N1 ═ N, the cluster is the core protein of pan-genomic analysis, and the cluster file is output;

according to the distribution characteristics of the single copy core protein in the species in the pan genomics orthiomcl clustering results: if the number of species contained in the cluster is N1 ═ N and m1=m2=m3=.......=m N1, this cluster is a single copy core protein for pan-genomic analysis, the cluster file is exported;

according to the distribution characteristics of non-essential proteins in species in the pan-genomics OrthoMCL clustering results: number of species contained in the cluster N1<N and m1,m2,m3,.......,mNAt least two of the cluster are not 0, the cluster is a non-essential protein in the genome analysis, and the cluster file is output;

according to the distribution characteristics of specific proteins in species in the clustering result of the generic genomics OrthoMCL: number of species contained in the cluster N1<N and m1,m2,m3,.......,mNOne of the cluster is 0, the cluster is a specific protein of pan-genome analysis, and the cluster file is output;

as a further optimization scheme of the present invention, the outputting the analysis result includes:

outputting the distribution of various types of proteins in each cluster, i.e. outputting m1,m2,m3,......,mNTo achieve a count of the number of each type of protein;

the nucleic acid sequences of each class of proteins in each cluster, including the sequence of the single copy core protein, are exported to provide the output text for subsequent pan-genomic analysis.

The method can output files required by main analysis content in the current pan-genomic analysis, and the nucleic acid and protein output files obtained by the processing of the invention can be directly used as input files required by the subsequent personalized analysis of the pan-genomic analysis without other processing.

The invention establishes automatic identification of various proteins in pan-genomic analysis based on the OrthoMCL clustering result, including representative proteins, core proteins, single copy core proteins and species-specific proteins. Based on the respective classifications of these proteins, the number of these classified proteins existing in the respective species is counted, and the results are output according to the classifications. The method enables the output of representative sequences for proteins in various classes and for proteins of various classes in each species. In addition, the method also outputs the protein homologous clustering result according to the sequence corresponding to each homologous protein, particularly the output of a single-copy core protein sequence, so as to realize the multi-sequence comparison and the evolutionary tree analysis of the single-copy protein in the subsequent genome-wide analysis and the calculation of the selection pressure of genes, thereby laying a foundation for realizing higher-level personalized analysis in the genome-wide analysis.

Compared with the prior art, the invention has the beneficial effects that:

(1) the universality is high; the analysis method used was based on the clustering results of OrthoMCL, processed independently of the subject of the pan-genomic analysis;

(2) the added value is high; a file required by subsequent genome-wide analysis can be generated based on the OrthoMCL clustering result, and meanwhile, an effective interface can be provided according to data required in an actual project, so that more additional values are output;

(3) the usability is strong; the method is simple, easy to understand and use and convenient to operate.

Drawings

FIG. 1 is a flow chart of the steps of the fast analysis method based on OrthoMCL clustering results of the present invention;

FIG. 2 is a cluster name and number statistics for 39 homologous proteins in the 4554 single copy core protein sequence of trametes of example 1;

FIG. 3 is the protein sequence of the homologous protein of 9 species in which the plugs of example 1 each have a single copy of the core protein cluster10001 corresponding to each species;

FIG. 4 shows the statistics of the number of specific proteins in 9 species of atypical veillonella in example 2;

FIG. 5 is the protein sequence of the homologous protein cluster60 in 9 species of Sesarillonella sarmentosa according to example 2 corresponding to each species;

FIG. 6 is a partial statistical result of the core protein of Porphyromonas gingivalis of example 3;

FIG. 7 shows the protein sequences of 66 species of P.gingivalis in which the homologous protein cluster459 corresponds to each species.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below by combining the specific drawings.

A method for rapidly analyzing clustering results based on OrthoMCL, as shown in fig. 1, includes the following steps:

step S1, obtaining the nucleic acid sequence and the protein sequence of each species to be analyzed, carrying out homology clustering on the protein sequences of all the species to be analyzed by using OrthoMCL cluster analysis software, and outputting an OrthoMCL clustering result;

step S2, setting the sequence used in the pan-genomic analysisThe number of the used species is N, and the number N1 of the species contained in the cluster of each corresponding cluster file in the OrthoMCL clustering result and the number m of the proteins contained in each species in the cluster of each corresponding cluster file are counted1,m2,m3,......,mNObtaining OrthoMCL clustering parameters;

step S3, classifying the proteins according to the OrthoMCL clustering parameters, including:

according to the distribution characteristics of the core protein in the species in the pan genomics OrthoMCL clustering results: if the number of species contained in the cluster is N1 ═ N, the cluster is the core protein of pan-genomic analysis, and the cluster file is output;

according to the distribution characteristics of the single copy core protein in the species in the pan genomics orthiomcl clustering results: if the number of species contained in the cluster is N1 ═ N and m1=m2=m3=.......=m N1, this cluster is a single copy core protein for pan-genomic analysis, the cluster file is exported;

according to the distribution characteristics of non-essential proteins in species in the pan-genomics OrthoMCL clustering results: number of species contained in the cluster N1<N and m1,m2,m3,.......,mNAt least two of the cluster are not 0, the cluster is a non-essential protein in the genome analysis, and the cluster file is output;

according to the distribution characteristics of specific proteins in species in the clustering result of the generic genomics OrthoMCL: number of species contained in the cluster N1<N and m1,m2,m3,.......,mNOne of the cluster is 0, the cluster is a specific protein of pan-genome analysis, and the cluster file is output;

step S4, outputting an analysis result according to the partition result of step S3 and the species nucleic acid sequence of step S1, comprising:

outputting the distribution of various types of proteins in each cluster, i.e. outputting m1,m2,m3,......,mNThereby realizing various kinds of proteinsCounting the number of qualities;

the nucleic acid sequences of each class of proteins in each cluster, including the sequence of the single copy core protein, are exported to provide the output text for subsequent pan-genomic analysis.

The method can output files required by main analysis content in the current pan-genomic analysis, and the nucleic acid and protein output files obtained by the processing of the invention can be directly used as input files required by the subsequent personalized analysis of the pan-genomic analysis without other processing.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种鉴定关键酶基因的植物物种特异性序列片段的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!