Gene regulation and control network reconstruction method based on cross-platform causal network structure

文档序号:1143089 发布日期:2020-09-11 浏览:10次 中文

阅读说明:本技术 一种基于跨平台因果网络结构的基因调控网络重建方法 (Gene regulation and control network reconstruction method based on cross-platform causal network structure ) 是由 李弘� 张金喜 曾晓南 于 2020-05-12 设计创作,主要内容包括:本发明公开了一种基于跨平台因果网络结构的基因调控网络重建方法,包括:基于连续型因果网络结构建立离散型的平台节点,得到跨平台网络结构骨架;基于学习算法对所述跨平台网络结构骨架进行学习,对每个变量直接相连的变量集中的节点进行连接,得到无向图;在所述无向图中确定所述跨平台网络结构骨架中存在的v-结构,得到部分有向图;根据约束规则最大化标记所述部分有向图中剩余的无向边,得到最大化标志方向的有向图;本发明将基因调控网络视作因果图,基因测序平台视作因果图上的一个特殊节点,在重建跨平台基因调控网络过程中,将平台变量加入每一个基因表达的调控变量集中,以此消除不同基因测序平台引起的差异影响。(The invention discloses a gene regulation and control network reconstruction method based on a cross-platform causal network structure, which comprises the following steps: establishing discrete platform nodes based on a continuous causal network structure to obtain a cross-platform network structure skeleton; learning the cross-platform network structure framework based on a learning algorithm, and connecting nodes in a variable set, which are directly connected with each variable, to obtain an undirected graph; determining a v-structure existing in the cross-platform network structure skeleton in the undirected graph to obtain a partial directed graph; marking the remaining undirected edges in the partial directed graph in a maximized manner according to a constraint rule to obtain the directed graph with the maximized sign direction; the gene control network is regarded as a causal graph, the gene sequencing platform is regarded as a special node on the causal graph, and in the process of reconstructing the cross-platform gene control network, the platform variables are added into the control variable set of each gene expression, so that the difference influence caused by different gene sequencing platforms is eliminated.)

1. A gene regulation network reconstruction method based on a cross-platform causal network structure is characterized by comprising the following steps:

establishing discrete platform nodes based on a continuous causal network structure to obtain a cross-platform network structure skeleton;

learning the cross-platform network structure framework based on a learning algorithm, and connecting nodes in a variable set, which are directly connected with each variable, to obtain an undirected graph;

determining a v-structure existing in the cross-platform network structure skeleton in the undirected graph to obtain a partial directed graph;

and marking the residual non-directional edges in the partial directed graph in a maximized manner according to a constraint rule to obtain the directed graph with the maximized sign direction.

2. The method for reconstructing a gene regulatory network based on a cross-platform causal network structure as claimed in claim 1, wherein said learning algorithm is used for learning said cross-platform network structure skeleton, and nodes in a variable set directly connected to each variable are connected to obtain an undirected graph, and specifically comprises:

according to the d-partition principle, when the variable node f exists in the parent-child node set PC (x) of the target node xiAnd the target node x is condition independent given the set of variables S, then the variable node f is determinediAnd the variable node f without the edge directly connected with the target node xiExcluded from PC (x).

3. The method for gene regulatory network reconstruction based on cross-platform causal network architecture of claim 2, wherein said step of determining variable nodes is: by means of an algorithm, in three stages with a variable set V ═ V1,v2,…,vnAnd taking the variables in the variables as target nodes one by one until a parent-child node set PC (x) corresponding to each variable is obtained.

4. The method for gene regulatory network reconstruction based on cross-platform causal network structure of claim 3, wherein said three stages comprise a growth stage, a pruning stage and a refining stage.

5. The method for gene regulatory network reconstruction based on cross-platform causal network structure of claim 3, wherein the algorithm for determining variable nodes is a Parents _ and _ Children algorithm.

6. The method of claim 1, further comprising: providing a mixed type condition independence test, and checking the condition independence among cross-platform data; the method specifically comprises the following steps:

examining a given set of continuous variablesAs a set of conditions, a continuous variable viWith another continuous variable vjCondition independence between;

examining a given set of continuous variables

Figure FDA0002488233830000022

examining a given set of continuous variablesAnd p, continuous variable viWith another continuous variable vjCondition independence between.

7. The cross-platform causal network structure-based gene regulation network reconstruction method of claim 6Method, characterized in that said test gives a set of continuous variables

Figure FDA0002488233830000024

using Z as given condition variable set, respectively calculating v by least square methodiAnd the linear regression equation of Z, and vjAnd a linear regression equation of Z, calculating residual errors respectively; calculating a partial correlation coefficient by using a simple correlation coefficient method, and performing Fisher-snow Z-conversion; make H0ij·zAssuming 0, a significance level α, if the following inequality holds, then H is rejected0:

Where Φ (·) is the normal distribution, N is the sample size, and | Z | is the number of given condition variables.

8. The method for cross-platform causal network architecture based gene regulatory network reconstruction as claimed in claim 7, wherein said performing a fisher Z-transform is according to the formula:

9. the method of claim 6, wherein the testing is for a given set of continuous variablesAnd p, continuous variable viWith another continuous variable vjThe condition independence includes:

for two continuous variables viAnd vjGiven a set of conditions { vKP, calculating partial correlation coefficients under each platform according to platform variables corresponding to the variables to obtain L partial correlation coefficients corresponding to the L platforms; converting the L partial correlation coefficients using a Fisher-Tropsch z-transform; propose hypothesis H0P is zero overall if H is accepted0If so, consider viAnd vjAt a given set of conditions vKP is independent of the condition, and under the condition of significance level α, H is rejected if the following inequality holds0

Figure FDA0002488233830000032

Wherein the content of the first and second substances,

Figure FDA0002488233830000033

10. The method of claim 9, wherein the partial correlation coefficient is

Figure FDA0002488233830000034

Technical Field

The invention relates to the field of gene regulation networks, in particular to a gene regulation network reconstruction method based on a cross-platform causal network structure.

Background

In the late genome era of 2001, the direction of biological research has turned to the study of functional genome. In terms of genome function, the expression of one gene may be under regulatory control of one or more other genes or molecules. The traditional method for searching the regulation relationship through biological experiments is very expensive, and at present, the regulation relationship among genes is found by using a large amount of gene expression data, reverse engineering and other methods through a computer technology, so that the method is a hotspot for gene regulation network research. However, different sequencing platforms have no direct comparability of gene expression data under different sequencing platforms due to the difference of technical means and operating equipment. There is a "high-dimensional, small sample" imbalance in gene expression data for a single sequencing platform, and to overcome this imbalance, there have been many recent studies attempting gene regulation network reconstruction using gene expression data from multiple platforms.

One common method is to integrate data of multiple platforms and then perform network reconstruction; the method generally combines cross-platform data into a whole gene expression data matrix which can be directly compared by using a certain stretching or compressing rule and integrating the gene expression data which have batch difference and can not be directly compared through some data conversion methods. Another method is to reconstruct the gene control network of each platform separately and then integrate the results under each platform by statistical methods. However, most of the above network reconstruction methods are applied to gene expression data on a single platform, and due to the difference influence caused by different gene sequencing platforms, the condition independence test applied in the causal network algorithm cannot measure discrete variables and continuous variables simultaneously.

Disclosure of Invention

The invention provides a cross-platform causal network structure-based gene regulation network reconstruction method, which is characterized in that a gene regulation network is regarded as a causal graph, a gene sequencing platform is regarded as a special node on the causal graph, and in the process of reconstructing the cross-platform gene regulation network, a platform variable is added into a regulation variable set expressed by each gene so as to eliminate the difference influence caused by different gene sequencing platforms.

In order to solve the above technical problems, an embodiment of the present invention provides a method for reconstructing a gene regulatory network based on a cross-platform causal network structure, including:

establishing discrete platform nodes based on a continuous causal network structure to obtain a cross-platform network structure skeleton;

learning the cross-platform network structure framework based on a learning algorithm, and connecting nodes in a variable set, which are directly connected with each variable, to obtain an undirected graph;

determining a v-structure existing in the cross-platform network structure skeleton in the undirected graph to obtain a partial directed graph;

and marking the residual non-directional edges in the partial directed graph in a maximized manner according to a constraint rule to obtain the directed graph with the maximized sign direction.

As a preferred scheme, the learning algorithm-based learning of the cross-platform network structure skeleton is performed, nodes in a variable set directly connected to each variable are connected, and an undirected graph is obtained, which specifically includes:

according to the d-partition principle, when the variable node f exists in the parent-child node set PC (x) of the target node xiAnd the target node x is condition independent given the set of variables S, then the variable node f is determinediAnd the variable node f without the edge directly connected with the target node xiExcluded from PC (x).

As a preferred scheme, the step of determining the variable node is as follows: by means of an algorithm, in three stages with a variable set V ═ V1,v2,…,vnAnd taking the variables in the variables as target nodes one by one until a parent-child node set PC (x) corresponding to each variable is obtained.

Preferably, the three stages include a growth stage, a pruning stage, and a refining stage.

Preferably, the algorithm for determining the variable nodes is a scores _ and _ Children algorithm.

Preferably, the method for reconstructing a gene regulatory network based on a cross-platform causal network structure further comprises: providing a mixed type condition independence test, and checking the condition independence among cross-platform data; the method specifically comprises the following steps:

examining a given set of continuous variablesAs a set of conditions, a continuous variable viWith another continuous variable vjCondition independence between;

examining a given set of continuous variables

Figure BDA0002488233840000032

As a set of conditions, a continuous variable viConditional independence from platform variable p;

examining a given set of continuous variablesAnd p, continuous variable viWith another continuous variable vjCondition independence between.

Preferably, the test gives a set of continuous variables

Figure BDA0002488233840000034

As a set of conditions, a continuous variable viWith another continuous variable vjThe condition independence includes:

using Z as given condition variable set, respectively calculating v by least square methodiAnd the linear regression equation of Z, and vjAnd a linear regression equation of Z, calculating residual errors respectively; calculating a partial correlation coefficient by using a simple correlation coefficient method, and performing Fisher-snow Z-conversion; make H0ij·ZAssuming 0, a significance level α, if the following inequality holds, then H is rejected0:

Where Φ (·) is the normal distribution, N is the sample size, and | Z | is the number of given condition variables.

Preferably, the formula for performing the fischer-tropsch conversion is:

Figure BDA0002488233840000036

preferably, the test gives a set of continuous variables

Figure BDA0002488233840000037

And p, continuous variable viWith another continuous variable vjThe condition independence includes:

for two continuous variables viAnd vjGiven a set of conditions { vKP, calculating partial correlation coefficients under each platform according to platform variables corresponding to the variables to obtain L partial correlation coefficients corresponding to the L platforms; converting the L partial correlation coefficients using a Fisher-Tropsch z-transform; propose hypothesis H0P is zero overall if H is accepted0If so, consider viAnd vjAt a given set of conditions vKP is independent of the condition, and under the condition of significance level α, H is rejected if the following inequality holds0

Wherein the content of the first and second substances,representing a mean of 0 and a mean square error as the inverse of the cumulative function of the L-normal distribution.

Preferably, the partial correlation coefficient isFor the L partial phasesAfter the correlation coefficient is converted, z (i, j | k) is obtained as { z }1(i,j|k),z2(i,j|k),…,zL(i,j|k)}。

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

1. the gene control network is regarded as a causal graph, the gene sequencing platform is regarded as a special node on the causal graph, and in the process of reconstructing the cross-platform gene control network, the platform variables are added into the control variable set of each gene expression, so that the difference influence caused by different gene sequencing platforms is eliminated.

2. The cross-platform causal structure learning method and the mixed type condition independence test can be realized.

Drawings

FIG. 1: three basic connection diagrams exist for the variable of the continuous causal network;

FIG. 2: is v in the example of the invention1And v2Schematic diagrams separated by Zd;

FIG. 3: the cross-platform causal network is a schematic diagram of a cross-platform causal network according to an embodiment of the invention;

FIG. 4: the invention discloses a cross-platform causal network framework schematic diagram;

FIG. 5: is a partial directed graph of an embodiment of the invention;

FIG. 6: the pattern is identified for maximum in the embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 6, a preferred embodiment of the present invention provides a method for reconstructing a gene regulatory network based on a cross-platform causal network structure, including:

and S1, establishing discrete platform nodes based on the continuous causal network structure to obtain a cross-platform network structure skeleton.

In particular, a continuous causal network structure means that data samples corresponding to all variable nodes are all continuously distributed. Wherein a variable v is given1,v2Without direct causal relationship, by a third variable v3As intermediate variables; there may be three basic connection cases of a forward connection structure, a branch connection structure, and a sink connection structure. If v is1And v2Is blocked by a node set Z, when the value of variable in Z is determined, v is changed1(or v)2) Value of (v) cannot be obtained2(or v)1) Producing an influence, called v1And v2Separated by Zd. I.e. v1And v2The conditions are independent given Z.

Specifically, one discrete platform is introduced that has an impact on all variables. An edge exists between the platform node and other variables, the edge points to the variable node from the platform node, the platform variable causes the variable, and the variable node is a result variable. The directed edges between variables have the same meaning as those of a continuous causal network, i.e. variable viAnd variable vjThere is a causal relationship vi→vj,viIs a causal variable, vjIs the result variable.

And S2, learning the cross-platform network structure skeleton based on a learning algorithm, and connecting the nodes in the variable set directly connected with each variable to obtain an undirected graph.

Specifically, the causal network skeleton learning method is to find a variable set in which each variable is directly connected, namely a parent-child node set, by using d-separation and conditional independence tests, and then connect the nodes to obtain an undirected graph, and specifically includes: according to the d-partition principle, if there is a variable node f in the set of parent and child nodes PC (x) of xiAnd the target node x is condition independent given the set of variables S, then the variable node fiWith no directly-connected edge, variable node f, between target node xiShould be excluded from pc (x).

Specifically, the method for finding the parent node and the child node of the variable refers to that the variable set V is { V ═ V through a Parnts _ and _ Children algorithm1,v2,...,vnThe variables in the method are used as target nodes one by one until a parent-child node set PC (x) corresponding to each variable is obtained, and the method specifically comprises three stages:

a growing phase, one by one, of the variables v in the set of candidate nodesiPerforming a conditional independence test with the target node x, if there is no conditional independence given any subset S in the current PC (x), then v will beiAdding x into the set of parent and child nodes PC (x) and deleting from the set of candidate nodes C (x).

During the pruning stage, given the v just addediTo condition set so that the variable node v 'that had been previously added to pc (x) is condition independent of the target node x, v' is removed from pc (x). The residual variable nodes in the candidate node set C (x) are respectively connected with viPerforming a conditional independence test, if v' exists in C (x) and the target node if x exists, at the given newly added variable node viAnd the conditions are independent, v' is removed from C (x).

And continuously repeating the growth stage and the pruning stage until all the variables in the candidate node set C (x) are deleted or the number of the variables in the PC (x) reaches a certain upper limit.

Refining stage, for variable node v in PC (x)jIf there is a collectionSo that v isjConditional independently of the target node x given S, then v will bejDeleted from PC (x).

S3, determining a v-structure existing in the cross-platform network structure skeleton in the undirected graph to obtain a partial directed graph. Specifically, the v-structure is a junction structure, and the direction of the edge can be determined by a conditional independence test.

And S4, maximally marking the residual non-directional edges in the partial directed graph according to a constraint rule to obtain the directed graph with the maximized sign direction. Specifically, according to constraint rules of no generation of redundant v-structures, no loop and the like, the directions of the remaining non-directional edges are marked continuously until no more non-directional edges can be marked, and a causal network structure diagram of the maximized marking direction is obtained; and regarding edges of which the direction cannot be judged partially through the constraint conditions, and keeping the edges as undirected edges in the network graph.

The technical solution of the present invention will be described in detail with reference to the following specific examples.

Fig. 1 shows three basic connection cases of a sequential connection structure, a branch connection structure and a sink connection structure of variables of a continuous causal network.

In a specific embodiment, the cis-link structure is as shown in fig. 1 (a): if the variable v is unknown3Is then the slave variable v1The obtained information will influence the pair v3Reliability of prediction, in turn, for variable v2The prediction may also be affected; the information may then be at v1And v2And they are related to each other. If the variable v is known3Is then given from v1The obtained information will not be aligned with v any more3Has an effect on v2With an effect. v. of1And v2Cannot pass through v3The communication between the two is carried out, i.e. the information channel is blocked. Thus v1And v2At a given v3Are independent of each other.

As shown in FIG. 1(b), when the variable v is a division structure3When the information of (2) is unknown, the variable v of the information is not influenced1And variable v2Is transmitted between v1And v2Are related to each other; when v is known3When information, v1And v2Is blocked, and thus v1And v2At a given v3Are independent of each other.

When the variable v is a confluent structure as shown in FIG. 1(c)3When unknown, variable v1And variable v2Independent of each other; but in the variable v3Is known to be determined, v1And v2Are related to each other.

FIG. 2 is v1And v2Is divided by Zd to be v1And v2The conditions are independent given Z.

In a specific embodiment, let Z be a set of nodes, node v1And node v2Not provided for in Z α is v1And v2A path therebetween, v is said to be when any one of the following conditions is satisfied1And v2The passages α between are separated by Zd:

(1) α has a direct connection node or a branch connection node in Z, as shown in FIG. 2(a) and FIG. 2 (b);

(2) α has a junction node v3Z does not include the sink node v3And descendant nodes, as shown in FIG. 2 (c).

FIG. 3 is a cross-platform causal network including 4 variable nodes, with the introduction of a discrete platform with effects on all variables; each variable node is affected by a platform variable p, variable v3Is subject to variable v1And v2Is the result variable of the joint influence of (c), and is also the variable v4The causal variable of (a).

Fig. 4 is an undirected graph corresponding to a causal network found by learning a constructed cross-platform causal network framework.

In specific embodiments, { v1,v3,v5,v6P is a set of parent and child nodes for variable node x, using PC (x) ═ v1,v3,v5,v6P }. Two variables viAnd vjDirectly connected means that there is no subset S to viAnd vjd-is separated, then there is vi∈PC(vj),vj∈PC(vi)。

FIG. 5 is a determination of v-structures present in a network skeleton resulting in a partial directed graph.

In a particular embodiment, a variable node v is given1,v2And v3If there is a variable node setThe following conditions are satisfied: v. of1And v3Conditions are independent given S and v1And v3Given { S, v2When the conditions are not independent, v is determined1,v2And v3Forming a v-structure and forming an undirected edge v between the three variables1-v2-v3Marked v1→v2←v3

FIG. 6 is a partial directed graph of maximized marker directions obtained by maximizing the remaining undirected edges in the labeled network graph according to the constraint rules.

In a particular embodiment, v1→x-v5X-v can be modified according to constraints that do not create redundant v-structures5Is identified as v1→x→v5;v3-v2-v4It remains in the causal network graph in an edgeless manner.

The cross-platform gene regulation and control network is constructed through the cross-platform causal discovery algorithm, so that the negative influence that part of gene expression data biological information is deleted by mistake due to the fact that data are excessively smooth in the data preprocessing process can be avoided, and the more generally applicable gene regulation and control network is constructed. The method adds a special platform node on a general causal network model, uses the edge between the platform node and the variable to represent the influence of the platform on each variable, and takes the platform variable as one of condition sets in the process of learning the causal relationship between the variables so as to eliminate the difference influence of the platform on the variables. The cross-platform causal network structure learning method is also provided, and the cross-platform causal network structure learning algorithm in the cross-platform causal relationship algorithm mainly comprises three steps: learning a network framework to find an undirected graph corresponding to a causal network; determining a v-structure existing in the network skeleton, wherein the obtained result is a partial directed graph; and thirdly, maximizing the residual undirected edges in the marked network graph according to a constraint rule to obtain a partial directed graph in the maximized marker direction.

In another embodiment, the method for gene regulation network reconstruction based on cross-platform causal network structure further comprises: s5, providing a mixed type conditional independence test, and checking conditional independence among cross-platform data; the method specifically comprises the following steps:

in the first case: examining a given set of continuous variables

Figure BDA0002488233840000081

As a set of conditions, a continuous variable viWith another continuous variable vjCondition independence between;

in the second case: examining a given set of continuous variables

Figure BDA0002488233840000082

As a set of conditions, a continuous variable viConditional independence from platform variable p;

in the third case: examining a given set of continuous variables

Figure BDA0002488233840000083

And p, continuous variable viWith another continuous variable vjCondition independence between.

Specifically, in the first case, v is obtained by the least square method using Z as a given conditional variable setiAnd the linear regression equation of Z, and vjAnd a linear regression equation of Z, calculating residual errors respectively; and then calculating a partial correlation coefficient by using a simple correlation coefficient method, and performing Fisher-snow Z-conversion:

make H0:ρij·ZAssuming 0, a significance level α, if the following inequality holds, then H is rejected0

Figure BDA0002488233840000085

Where Φ (·) is the normal distribution, N is the sample size, and | Z | is the number of given condition variables.

In particular, the second case defaulted variable viAnd platform variable p are interrelated and therefore not condition independent.

In particular, the third case is for two continuous variables viAnd vjGiven a set of conditions { vKP, calculating partial correlation coefficients under each platform according to platform variables corresponding to the variables to obtain L partial correlation coefficients corresponding to the L platformsThe L partial correlation coefficients are transformed using a fischer-tropsch z-transform to obtain z (i, j | k) ═ z1(i,j|k),z2(i,j|k),…,zL(i,j|k)}。

Propose hypothesis H0P is zero overall if H is accepted0If so, consider viAnd vjAt a given set of conditions vKP is conditional independent under the significance level α, H is rejected if the following inequality holds0

Figure BDA0002488233840000092

Wherein the content of the first and second substances,representing a mean of 0 and a mean square error as the inverse of the cumulative function of the L-normal distribution.

The invention provides a mixed type condition independence test, which is designed on the basis that a partial correlation coefficient is used for the condition independence test in order to judge the condition independence among cross-platform data variables, and a discrete type platform variable is used as one of condition sets for judging the condition independence among variables.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:含二硫键多肽的结构预测方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!