Method for mining omics data based on graph theory and greedy algorithm

文档序号:1965074 发布日期:2021-12-14 浏览:29次 中文

阅读说明:本技术 一种基于图论和贪婪算法挖掘组学数据的方法 (Method for mining omics data based on graph theory and greedy algorithm ) 是由 王敏 夏梦雷 王頔 闫欣瑶 夏艺铭 郑宇� 申雁冰 于 2021-08-19 设计创作,主要内容包括:本发明公开了一种基于图论和贪婪算法挖掘组学数据的方法。所述方法包括步骤:利用统计学方法,计算组学对象差异性,对通路差异分布进行基因组尺度可视化;将组学对象转化为对应的基元反应,以反应物为起点,生成物为终点,构建由化合物组成的邻接矩阵,搭建代谢网络;以显著差异点为中心,运用贪婪算法进行网络精简,获得差异点之间的联通关系;对网络拓扑学结构进行解析。本发明整合了统计学和图论方法,实现了组学数据的有效降维和可视化,实现组学数据的准确挖掘。(The invention discloses a method for mining omics data based on graph theory and greedy algorithm. The method comprises the following steps: calculating the difference of omics objects by using a statistical method, and carrying out genome scale visualization on the path difference distribution; converting an omics object into a corresponding elementary reaction, constructing an adjacent matrix consisting of compounds by taking a reactant as a starting point and a product as an end point, and building a metabolic network; taking the obvious difference points as a center, and carrying out network simplification by using a greedy algorithm to obtain a communication relation between the difference points; and analyzing the network topological structure. The invention integrates statistical and graph theory methods, realizes effective dimension reduction and visualization of omics data, and realizes accurate mining of the omics data.)

1. The method for mining omics data based on graph theory and greedy algorithm is characterized by comprising the following steps:

s1, calculating the difference of omics objects by using a statistical method, and carrying out genome scale visualization on the difference distribution of the access;

s2, converting the omics object into corresponding elementary reaction, constructing an adjacent matrix consisting of compounds by taking the reactant as a starting point and the product as an end point, and building a metabolic network;

s3, taking the obvious difference points as a center, and carrying out network simplification by using a greedy algorithm to obtain a communication relation between the difference points;

and S4, analyzing the network topological structure.

2. The method for graph theory and greedy algorithm based omics data mining of claim 1 wherein the method for genome scale visualization of pathway difference distribution is as follows:

and calculating the difference of corresponding genes in omics object data by utilizing statistical difference analysis, normalizing the difference data, coupling the normalized difference data with colormap by taking a genome-scale general metabolic network diagram as a bottom plate, and presenting the channel distribution and the difference of metabolites related to the omics through the change of different colors.

3. The method for mining omics data based on graph theory and greedy algorithm as defined in claim 1, wherein the genome-scale general map of the metabolic network is drawn by a computer by specifying the coordinates of each gene and compound in advance according to the metabolic network formed by the reaction.

4. The method for mining omics data based on graph theory and greedy algorithm according to claim 2, wherein the difference data are normalized by mapping the difference data to 0-1 according to the following formula, setting gradient colors, and establishing colormap for drawing graphs by corresponding the data and the colors one by one;

wherein x is the original data, xminIs the minimum value of a set of data, xmaxThe maximum value of a group of data is obtained, and the obtained result X is the normalized data.

5. The method for mining omics data based on graph theory and greedy algorithm according to claim 2, wherein the construction method of the adjacency matrix is as follows:

acquiring all enzymatic reactions contained in omics data through an API (application program interface) of a KEGG database, taking out reactants and products one by one through an "→" resolution reaction, taking the reactants in the reaction as rows, taking the products as columns and taking normalized difference data as weight, and establishing an adjacency matrix;

and drawing the adjacency matrix by using a computer, visually communicating all related compounds in a graph theory mode, and building a metabolic network, so that all related compounds can be visualized.

6. The graph theory and greedy algorithm based omics data mining method of claim 1 wherein the greedy algorithm is processed as follows:

taking the difference points as a starting point, sequentially calculating shortest paths to other difference points, if the shortest paths do not contain the difference points, marking the two difference points as communication, and recording the communication relation of the non-difference points; and deleting the connection relation of all the non-difference points after all the difference points are calculated.

7. The graph theory and greedy algorithm-based omics data mining method of claim 1, wherein the topological structure comprises the pagerank coefficients of the nodes, the feature vector centrality, and the shortest and maximum flux paths between the difference points;

the importance degree of each node in the network connection structure is reflected through the PageRank, the importance degree of one node is reflected through the feature vector centrality and depends on the number and the importance of adjacent nodes, and the simplest connection mode between two genes is reflected through the shortest path and the maximum flux path.

8. The graph theory and greedy algorithm-based omics data mining method of claim 7, wherein said Pagerank coefficient is expressed as follows:

p1,p2,…,pNthe number of nodes is, and q is a damping factor; ,PageRank(pj) Is piAs the amount of reactant, L (p)j) Is pjThe number of products;

the feature vector centrality calculation method comprises the following steps:

in the formula, CECiFor the feature vector centrality of the node i, reflecting that the importance of one node depends on the number of adjacent nodes and the importance of the adjacent nodes, the influence of a single node can be regarded as the linear combination of other nodes, and the higher the degree is, the more important the node is in the network; c is a proportionality constant; n is the number of nodes; a isijI is the number of adjacent nodes; j is an initial value; x is the raw data.

9. The method for graph theory and greedy algorithm based omic data mining according to any of the claims 1 to 8, wherein said omic objects comprise transcriptomes, proteomics.

Technical Field

The invention relates to the technical field of omics data mining, in particular to a method for mining omics data based on graph theory and greedy algorithm.

Background

As scientific research continues to advance, as modern science becomes more and more aware of the importance of whole or systematic entities, some problems cannot be handled simply as local events, because when an individual part is placed in a high-level structure, they behave differently due to dynamic interactions that occur between them. This has a recent definition of system biology: unlike the molecular biology, which has been concerned only with individual genes and proteins, the present invention is directed to the study of the structure and system functions of the interrelationship between cell signaling and gene regulatory networks, and the composition of biological systems. The application of systematic biological methods in biological and medical research along with high-throughput sequencing technology enables people to collect more relevant information at the molecular level, mainly including genomics, transcriptomics, proteomics, metabolomics and the like.

The omics data are complex and diverse, and comprise data information such as reactants participating in metabolic reaction, products, corresponding enzymes, reversibility of the reaction and the like, so that a complex biological network with huge data volume is formed, and for the study on an omics model with huge data volume and an intricate biological mechanism, the complex biological network is intuitively understood by means of a visualization system, so that the implicit biological significance is observed. How to effectively integrate the multiomic data and extract information with biological significance from the multiomic data is a very challenging problem.

Disclosure of Invention

The invention aims to provide a method for mining omics data based on graph theory and greedy algorithm aiming at the technical defects in the prior art, and has important significance for the subsequent research in the fields of biology and medicine by mining the important characteristics of the omics data by using a topological method.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a method for mining omics data based on graph theory and greedy algorithm comprises the following steps:

s1, calculating the difference of omics objects by using a statistical method, and carrying out genome scale visualization on the difference distribution of the access;

s2, converting the omics object into corresponding elementary reaction, constructing an adjacent matrix consisting of compounds by taking the reactant as a starting point and the product as an end point, and building a metabolic network;

s3, taking the obvious difference points as a center, and carrying out network simplification by using a greedy algorithm to obtain a communication relation between the difference points;

and S4, analyzing the network topological structure.

As a preferred technical solution, the method for genome scale visualization of pathway difference distribution comprises the following steps:

and calculating the difference of corresponding genes in omics object data by utilizing statistical difference analysis, normalizing the difference data, coupling the normalized difference data with colormap by taking a genome-scale general metabolic network diagram as a bottom plate, and presenting the channel distribution and the difference of metabolites related to the omics through the change of different colors.

In a preferred embodiment, the genome-scale metabolic network map is prepared by defining coordinates of each gene and each compound in advance based on a metabolic network formed by a reaction, and is drawn by a computer.

As an optimal technical scheme, the difference data are normalized by mapping the difference data to 0-1 according to the following formula, setting gradual change colors, and establishing colormap for drawing graphs by corresponding the data and the colors one by one;

wherein x is the original data, xminIs the minimum value of a set of data, xmaxThe maximum value of a group of data is obtained, and the obtained result X is the normalized data.

As a preferred technical solution, the construction method of the adjacency matrix is as follows:

acquiring all enzymatic reactions contained in omics data through an API (application program interface) of a KEGG database, taking out reactants and products one by one through an "→" resolution reaction, taking the reactants in the reaction as rows, taking the products as columns and taking normalized difference data as weight, and establishing an adjacency matrix;

and drawing the adjacency matrix by using a computer, visually communicating all related compounds in a graph theory mode, and building a metabolic network, so that all related compounds can be visualized.

As a preferred technical solution, the processing procedure of the greedy algorithm is as follows:

taking the difference points as a starting point, sequentially calculating shortest paths to other difference points, if the shortest paths do not contain the difference points, marking the two difference points as communication, and recording the communication relation of the non-difference points; and deleting the connection relation of all the non-difference points after all the difference points are calculated.

As a preferred technical solution, the topological structure includes a pagerank coefficient of a node, a feature vector centrality, a shortest path between difference points, and a maximum flux path;

the importance degree of each node in the network connection structure is reflected through the PageRank, the importance degree of one node is reflected through the feature vector centrality and depends on the number and the importance of adjacent nodes, and the simplest connection mode between two genes is reflected through the shortest path and the maximum flux path.

As a preferred technical solution, wherein the expression of the Pagerank coefficient is as follows:

p1,p2,…,pNthe number of nodes is, and q is a damping factor; PageRank (p)j) Is piAs the amount of reactant, L (p)j) Is pjThe number of products;

the feature vector centrality calculation method comprises the following steps:

in the formula, CECiFor the feature vector centrality of the node i, reflecting that the importance of one node depends on the number of adjacent nodes and the importance of the adjacent nodes, the influence of a single node can be regarded as the linear combination of other nodes, and the higher the degree is, the more important the node is in the network; c is a proportionality constant; n is the number of nodes; a isijI is the number of adjacent nodes; j is an initial value; x is the raw data.

As a preferred technical scheme, the omics object comprises transcriptome and proteome.

According to the API of the KEGG database, all genes corresponding to transcriptomics and proteomics are obtained and are subjected to statistical difference processing; formulating a mapping relation between the difference coefficient and the color, and carrying out whole genome scale visualization on the gene to show the difference of omics data in the pathway scale; and then according to the enzymatic reaction corresponding to the genes, taking the reactants in the reaction as rows and the products as columns, taking the data after the normalization of the genes corresponding to the transcriptomics and the proteomics as weights, building a metabolic network, simplifying the network by using a greedy algorithm, and finally carrying out structural topological analysis on the network to realize the accurate mining of the omics data.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for mining omics data based on graph theory and greedy algorithm in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of genome-scale visualization of pathway difference distribution according to an embodiment of the present invention;

FIG. 3 is a flow chart of building a metabolic network according to an embodiment of the present invention;

FIG. 4 is a difference distribution of the furfural-stressed VS furfural-unstressed transcriptome of Clostridium acetobutylicum according to an embodiment of the present invention;

FIG. 5 shows the differential distribution of the protein group of Clostridium acetobutylicum with furfural stress VS and without furfural stress according to the embodiment of the present invention;

FIG. 6 is a metabolic network automatically built from omics elementary reactions according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a greedy algorithm reduced metabolic pathway of an embodiment of the present invention;

FIG. 8 is the result of a Clostridium acetobutylicum 24h transcriptome KEGG enrichment according to an embodiment of the present invention;

FIG. 9 is the results of Clostridium acetobutylicum 24h proteome GO enrichment of the examples of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1-9, the method for mining omics data based on graph theory and greedy algorithm of the embodiment of the present invention includes the following steps:

s1, calculating the difference of omics objects by using a statistical method, and carrying out genome scale visualization on the difference distribution of the access;

wherein, the omics object can be transcriptome and proteome;

the method for genome scale visualization of the pathway difference distribution can be realized by the following 4 steps:

s11, acquiring related genes of all metabolic pathways through an API (application program interface) of a KEGG database; for the returned data, the corresponding gene number can be obtained by using a regular expression 'K [0-9] {5 }';

s12, searching a drawing information database through the serial number of the gene, and determining the position information of the gene;

s13, calculating the difference of corresponding genes in omics object (transcriptome and proteome) data by utilizing statistical difference analysis, carrying out normalization processing on the difference data, and mapping the data to 0-1, wherein the specific formula is as follows: and setting gradual change colors according to the normalized data, and determining the drawing color of the gene by corresponding the data and the colors one by one.

In the formula: x is the original data, xminFor the minimum value of this set of data, xmaxThe maximum value of the group of data is obtained, and the obtained result X is the normalized data;

as an alternative example, the statistical difference analysis in this step is performed by using P-test, i.e. the statistical difference is P <0.05 in the P value obtained by the significance test method.

S14, utilizing the correspondence between colormap and omics data to visualize the passage on the genome-scale general map of the metabolic network.

In the step, a genome-scale general metabolic network diagram is taken as a bottom plate, differential data and colormap are coupled, and the channel distribution and the difference of metabolites related to the omics are presented through the change of different colors, so that the genome-scale visualization processing of the channel differential distribution is realized, and the channel differential distribution visualization processing is completed.

The genome-scale general map of the metabolic network is formed by defining coordinates of each gene and compound in advance according to the metabolic network formed by the reaction and drawing the coordinates by a computer.

And S2, converting the omics objects (transcriptome and proteome) into corresponding elementary reactions, and constructing an adjacent matrix consisting of compounds by taking the reactants as a starting point and the products as an end point to build a metabolic network.

As an alternative embodiment, the specific method for building the metabolic network can be realized by the following steps:

s21, obtaining all enzymatic reactions contained in omics data through an API of a KEGG database;

s22, taking out reactants and products one by one through a "→" resolution reaction, taking the reactants as rows, taking the products as columns and taking the normalized difference data as weights, and establishing an adjacency matrix;

s23, drawing the adjacency matrix by using a computer, and visually communicating all the related compounds in a graph theory mode to build a metabolic network;

and S3, taking the obvious difference points as a center, and simplifying the network by using a greedy algorithm to obtain communication among the difference points. If p <0.05 is determined as the difference gene, the point represented by the difference gene is a significant difference point;

as an alternative embodiment, the specific algorithm for network reduction using the greedy algorithm may be implemented by the following steps:

taking the significant difference points as a starting point, sequentially calculating shortest paths to other difference points, if the shortest paths do not contain the difference points, marking the two difference points as communication, and recording the communication relation of the non-difference points; and after all the difference points are calculated, deleting the communication relation of all the non-difference points, so as to realize the simplification of the metabolic network.

Preferably, the algorithm for calculating the shortest path in this step may be Dijkstra or Floyd.

And S4, analyzing the network topological structure.

And finally analyzing the topological structure of the compound communication network, wherein optionally, the topological structure comprises a pagerank coefficient of the nodes, a feature vector centrality, a shortest path between different points and a maximum flux path.

Wherein, the expression formula of the Pagerank coefficient is as follows:

p1,p2,…,pNthe number of nodes is, q is a damping factor, and the number is 0.85; PageRank (p)j) Is piAs the amount of reactant, L (p)j) Is pjThe amount of the product.

The feature vector centrality calculation method comprises the following steps:

in the formula: cECiFor the feature vector centrality of node i, reflect oneThe importance of the nodes depends on the number of the adjacent nodes and the importance of the adjacent nodes, the influence of a single node can be regarded as linear combination of other nodes, and the higher the degree is, the more important the node is in the network; c is a proportionality constant; n is the number of nodes; a isijI is the number of adjacent nodes; j is an initial value; x is the raw data.

Wherein, the importance degree of each node in the network connection structure can be seen through the PageRank; the centrality of the feature vector reflects the importance of one node and depends on the number and the importance of adjacent nodes; the shortest path and the maximum flux path make it possible to see the simplest linkage between two genes.

Taking gene A as an example, through the analysis of the topological structure of the network in step S4R, it can be known how much substance reacts with it in the metabolic pathway, and it is not a key, pivotal gene in the whole metabolic network. Where figure 5 is an analytical illustration. The GO analysis in FIGS. 6 and 7 shows what the functions of these genes are.

Taking clostridium acetobutylicum ATCC824 as an example below, a highly tolerant strain Tust-001, which can tolerate 4g/L furfural on a solid plate, was first obtained by furfural gradient tolerance. In order to clarify a furfural tolerance mechanism, 4g/L furfural is added into fermentation liquor, transcriptome and proteome sequencing is carried out on Tust-001 and wild bacteria ATCC824 at 24h according to an Illumina sequencing platform, and the metabolic difference of the two strains is compared. The statistical analysis of the Differentially Expressed Genes (DEGs) of the two strains is carried out by adopting Deseq2, the differential genes are determined by | log2FoldChange | ≧ log2(1.5) and p <0.05, and the statistical result shows that Tust-001 has 876 genes up-regulated and 963 genes down-regulated compared with ATCC824, wherein the obvious difference has 576; in proteomic data, a total of 4461 proteins were identified, of which 405 were Tust-001 differentially expressed compared to ATCC824, and 185 and 220 were up-and down-regulated differential proteins, respectively. The transcriptome and proteome were significantly different, 67. According to the KEGG channel enrichment and GO analysis results, the following results can be obtained: 40 genes are involved in membrane synthesis; 11 genes are involved in metal ion transport; 3 genes are associated with active trafficking; 5 genes involved in the glycoside hydrolysis process; 4 genes are involved in the process of DNA mismatch repair; 2 genes involved in mannose metabolism; 2 genes are involved in carbamoyl synthesis.

Through the analysis, it can be seen which node has the largest difference, and what nodes are affected by each other around, so as to grasp the specific information of a certain point (gene) and its specific function.

Wherein, the first fifteen genes of the Pagerank coefficient are respectively as follows: CA _ RS09525, CA _ RS10735, CA _ RS13785, CA _ RS13790, CA _ RS05520, CA _ RS09260, CA _ RS11005, CA _ RS11032, CA _ RS11075, CA _ RS1180, CA _ RS11602, CA _ RS11805, CA _ RS08085, CA _ RS08190, and CA _ RS 08193.

Through the functional analysis of the genes, the molecular mechanism of the clostridium acetobutylicum for tolerating the furfural is found to be mainly focused on a DNA repair system, a central carbon metabolic pathway and a glutathione metabolic pathway, and CA _ RS09525, CA _ RS10735, CA _ RS13785 and CA _ RS13790 are main differences in a metabolic network map.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于机器学习模型的堆肥腐熟度预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!