Gene regulation and control network construction method based on structure prediction

文档序号:1615532 发布日期:2020-01-10 浏览:17次 中文

阅读说明:本技术 一种基于结构预测的基因调控网络构建方法 (Gene regulation and control network construction method based on structure prediction ) 是由 王之琼 郭上慧 曲路渲 信俊昌 钱唯 于 2019-09-17 设计创作,主要内容包括:本发明提供一种基于结构预测的基因调控网络构建方法。包括以下内容:首先计算系数矩阵,通过计算基因之间的Pearson系数、互信息及最大互信息来确定基因之间的相关性,作为筛选潜在父节点集的依据;然后进行结构预测,将获得的基因之间的系数矩阵作为判定基因潜在父节点集的依据,为每个基因选取潜在父节点集;最后进行基因调控网络的结构学习和参数学习。本发明通过基于Person系数、互信息及最大互信息相结合的方法预测基因的潜在父节点集,缩小结构学习的搜索范围,在一定程度上减少了基因调控网络的构建时间,提升计算性能,可以更加快速、准确地构建大规模基因调控网络,进一步了解生物的基因调控机制。(The invention provides a gene regulation and control network construction method based on structure prediction. The method comprises the following steps: firstly, calculating a coefficient matrix, and determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes to serve as a basis for screening a potential father node set; then, carrying out structure prediction, taking the obtained coefficient matrix between the genes as a basis for judging a potential father node set of the genes, and selecting the potential father node set for each gene; and finally, performing structure learning and parameter learning of the gene regulation network. The method predicts the potential father node set of the gene by a method based on the combination of the Person coefficient, the mutual information and the maximum mutual information, reduces the search range of structure learning, reduces the construction time of the gene regulation and control network to a certain extent, improves the calculation performance, can construct the large-scale gene regulation and control network more quickly and accurately, and further knows the gene regulation and control mechanism of the organism.)

1. A gene regulation network construction method based on structure prediction is characterized by comprising the following steps:

step 1: calculating a coefficient matrix, and determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes to serve as a basis for screening a potential father node set;

step 2: structure prediction, namely selecting a potential father node set for each gene by using a coefficient matrix between the genes obtained in the step 1 as a basis for judging the potential father node set of the genes;

and step 3: structure learning, adopting a structure learning method based on score search, selecting BDe scores by a score function, setting the number of regulation sets of each gene as K, and performing structure learningIn gene xiWith a set of potential parent nodes PiTraversing sets of possible combinations of genes in the search space for a search space, computing BDe scores score for each of the sets, and judging the sets as x genes according to the scoresiThe quality of the parent node set;

and 4, step 4: local network merging, each gene corresponding to a local network GiWill [ G ]1,……,Gn]Merge into global network G [ x [ [ x ]1,G1],[x2,G2],…,[xn,Gn]]Let gene y ∈ GiThen gene xiThe existence of a regulatory relationship with gene y is represented as y → xi

And 5: and parameter learning, namely performing parameter learning on each regulation and control relation in the global network G, wherein the parameters comprise a regulation and control effect and a regulation and control probability, the regulation and control effect is expressed as excitation or inhibition, and the regulation and control probability is expressed as the posterior probability of a regulation and control gene and a target gene.

2. The method as claimed in claim 1, wherein the step 1 of calculating a coefficient matrix, determining the correlation between genes by calculating Pearson coefficients, mutual information and maximum mutual information between genes, as a basis for screening a potential parent node set, comprises the following steps:

step 1.1: expressing the gene expression data as a matrix X (m × n), wherein m represents the number of samples of the gene expression data, n represents the total number of genes, each row of the matrix X (m × n) is an expression data vector of one gene, and the expression data vector specifically expressed as the ith gene is expressed as XiDefining a matrix of 3 coefficients MPearson(n x n) calculation of Pearson coefficients for deposit between genes, definition Mmi(n x n) for storing the calculation results of mutual information, defining MMIC(n x n) for storing the calculation result of the maximum mutual information;

step 1.2: the Pearson coefficient, mutual information and maximum mutual information all have symmetrical attributes, so the calculation is carried outWhen the 3 coefficient matrixes are used, only the upper triangular matrix needs to be calculated, and the gene x is calculated by using the formula (1)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnStoring the calculated Pearson coefficient to MPearsonIn (n x n), M is specificallyPearson[i,i+1],MPearson[i,i+2],……,MPearson[i,n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically MPearson[i+1,i],MPearson[i+2,i],……,MPearson[n,i]Finally obtaining MPearsonThe whole matrix expression of (2);

in the formula, r (X)iY) represents a gene xiExpression data vector XiAnd the Pearson coefficient of the expression data vector Y for gene Y, which belongs to { x ∈ [)i+1,xi+2,…,xn},XipRepresents gene xiExpression data vector XiP sample expression value of (2), YjpThe p-th sample expression value of the expression data vector Y representing the gene Y,represents gene xiExpression data vector XiThe average value of the samples of (a),

Figure FDA0002204067630000023

step 1.3: calculation of Gene x Using equation (2)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnThe mutual information value obtained by calculation is storedPut to MmiIn (n x n), M is specificallymi[i,i+1],Mmi[i,i+2],……,Mmi[i,n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically Mmi[i+1,i],Mmi[i+2,i],……,Mmi[n,i]Finally obtaining MmiIs given by the overall matrix expression of (a),

Figure FDA0002204067630000024

in the formula, I (X)iY) represents a gene xiExpression data vector XiAnd the mutual information value of the expression data vector Y of gene Y, gene Y is belonged to { xi+1,xi+2,…,xn},|C(Xi) I represents the gene xiExpression data vector XiThe value of the determinant of the covariance matrix of (a), | C (Y) | represents the value of the determinant of the covariance matrix of the expression data vector Y of the gene Y, | C (X)iY) represents a vector XiAnd a determinant of a covariance matrix of vector Y;

step 1.4: calculation of Gene x Using equation (3)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnThe maximum mutual information value obtained by calculation is stored in MMICIn (n x n), i.e. MMIC[i,i+1],MMIC[i,i+2],……,MMIC[i,n]And copying the maximum mutual information value obtained by calculation to a corresponding lower triangle, specifically MMIC[i+1,i],MMIC[i+2,i],……,MMIC[n,i]Finally obtaining MMICIs given by the overall matrix expression of (a),

Figure FDA0002204067630000025

in the formula, mic (X)iY) represents a gene xiExpression data vector XiAnd the maximum mutual information value of the expression data vector Y of gene Y,gene y belongs to { x ∈ }i+1,xi+2,…,xn},I(XiAnd Y) represents the gene xiExpression data vector XiAnd a mutual information value of an expression data vector Y of the gene Y, a represents a size of gridding in the X-axis direction of the coordinate axis, B represents a size of gridding in the Y-axis direction of the coordinate axis, and a value of a parameter B is B ═ m0.6Wherein m represents the number of samples of gene expression data.

3. The method for constructing a gene regulatory network based on structural prediction as claimed in claim 1, wherein the structural prediction of step 2 is performed by selecting a potential parent node set for each gene by using a coefficient matrix between genes obtained in step 1 as a basis for determining the potential parent node set of the gene, and the specific steps are as follows:

step 2.1: defining the proportion of the potential father node set as lambda, then, the number num selected by the potential father node set is lambda x n, n is the total number of genes, and the value of lambda is determined through experiments;

step 2.2: reading MPearsonMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes with the highest absolute value corresponding to the first num Pearson coefficient values are selected to be put into PpearsonPerforming the following steps;

step 2.3: reading MmiMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes corresponding to the top num mutual information values with the highest absolute values are selected to be put into PmiPerforming the following steps;

step 2.4: reading MMICMiddle gene xiAnd removing said gene xiThe other genes are sorted according to the Pearson coefficient result, and the genes corresponding to the first num maximum mutual information values with the highest absolute values are selected to be put into PMICPerforming the following steps;

step 2.5: to Ppearson、PmiAnd PMICObtaining a set P by mergingiAnd combining said set PiAs gene xiA set of potential parent nodes.

4. The method as claimed in claim 1, wherein the step 3 of structure learning adopts a structure learning method based on score search, the score function selects BDe scores, the number of the regulation sets of each gene is set to K, and the target gene X is targetediWith a set of potential parent nodes PiTraversing sets of possible genes in the search space for searching the space, calculating BDe scores for each set, and judging the sets as the genes x according to the scoresiThe quality of the parent node set is specifically expressed as:

step 3.1: initializing a small top pile S, defining the size of the small top pile S as K, and calculating a gene x by using a formula (4)iBDe score when the parent node set of (1) is empty0And will be

Figure FDA0002204067630000031

wherein D ═ D1,D2,…,DnDenotes a gene { x }1,x2,…,xn"G denotes the sample set in x1,x2,…,xnIs a Bayesian network of nodes, Γ represents a Gamma function, qiIs the ith gene xiThe number of value combinations in the expression data vector of the parent node, riIs gene xiExpression data vector XiNumber of values in, mijkRepresents gene xiThe parent node of (2) takes the value of j, xiNumber of samples in k value, αijkParameters representing functions subject to Dirichlet distribution,

Figure FDA0002204067630000041

step 3.2: with potential parent node set PiSearching a space, arranging and combining genes in the search space, traversing the search space, calculating H sets according to a formula (5), and calculating BDe score for each set in the H sets by using a formula (4)xAnd sequentially compared with the BDE score in the elements of the top of the small top heap if scorex>score, then will scorexAnd scorexStoring the corresponding set into the small top heap S, if the small top heap S is full, deleting the heap top element, and then, storing the scorexAnd scorexStoring the corresponding set into a small top pile S; if scorexNot more than score, no logging operation is performed, x is a natural number and satisfies 0<x<H;

Figure FDA0002204067630000042

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene xiLocal network G ofiIn (1).

5. The method for constructing a gene regulatory network based on structural prediction according to claim 1, wherein the step 5 of parameter learning is to perform parameter learning on each regulatory relationship in the global network G, the parameters include a regulatory effect and a regulatory probability, the regulatory effect is represented as excitation or inhibition, the regulatory probability is represented as a posterior probability of a regulatory gene and a target gene, and the specific steps are as follows:

step 5.1: for the regulatory relationship y → xiGene y as a regulatory gene, gene xiAs the target gene, gene x was calculated using the formula (1)iAnd the Pearson coefficient of gene y, if said Pearson coefficient is greater than 0, defining gene xiThe regulation relation with the gene y is an incentive relation and is indicated by "+"; defining a gene x if said Pearson coefficient is less than 0iThe regulation relation with the gene y is a suppression relation and is represented by "-";

step 5.2: for the regulatory relationship y → xiCalculating the gene x using the formula (6)iAnd the posterior probability of gene y,

Figure FDA0002204067630000043

in the formula, P (Y | X)i) Is represented by the gene xiExpression data vector XiAnd the posterior probability, P (X), calculated from the expression data vector Y of the gene Yi) Represents gene xiExpression data vector XiP (Y) represents the probability distribution of the expression data vector Y for gene Y, P (X) relative to the joint conditional probability of the expression data vectors for all regulatory genesi| Y) represents the gene Y as the gene xiThe conditional probability of the parent node of (1);

the calculation formula of the joint conditional probability is as follows:

Figure FDA0002204067630000051

in the formula, pa (x)i) Represents gene xiParent node set of, P (X)i| Y) represents the gene Y as the gene xiThe parent node of (2) is the calculated conditional probability.

Technical Field

The invention relates to the field of medical informatics, in particular to a gene regulation and control network construction method based on structure prediction.

Background

Gene expression regulation, i.e., the process by which the expression of one gene is influenced by other genes, includes mainly regulation at the transcriptional and translational levels. Since the final form of gene expression is a protein, transcription to form mRNA is required as a template to ultimately produce the protein, gene regulation at the transcriptional level is critical. Transcription of one gene is affected by the expression product of another gene, acting as an excitation or inhibition, and the proteins produced by itself may affect other genes as well. This complex regulatory relationship ultimately constitutes a gene regulatory network. Understanding the gene regulation mechanism of organisms can understand the occurrence of various biological processes from the genetic point of view, reveal the adverse processes of organisms, and is important for finding the key principles and details of biological systems.

Some current gene regulation network models solve the problem of construction of a gene regulation network to a certain extent, but have some problems at the same time. For example, the Boolean model researches a gene regulation network in a qualitative angle, and is too coarse and simplified; the differential equation model describes the gene regulation and control network quantitatively and accurately through differential equations, but the problem of difficult optimization caused by excessive parameters is solved, and the calculated amount is huge. The Bayesian network model describes a gene regulation and control network through a probability model, and the regulation and control relation is represented by probability, but the calculation amount of the model is greatly increased along with the increase of the complexity of the network.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a gene regulation network construction method based on structure prediction. The construction of the global network can be decomposed into construction problems for the sub-networks, in the construction process of each sub-network, because a set possibly formed by all the remaining genes needs to be searched to obtain an optimal sub-network structure, the calculation time is in an exponential rising trend along with the increase of the number of the genes, a possible parent node set of each gene is predicted by a method based on the combination of a Pearson coefficient, mutual information and maximum mutual information, the optimal structure is searched in the parent node set, the search space is reduced to the greatest extent, and the calculation performance is improved.

In order to solve the technical problems, the invention provides a gene regulation and control network construction method based on structure prediction, which comprises the following steps:

step 1: calculating a coefficient matrix, and determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes to serve as a basis for screening a potential father node set;

step 2: structure prediction, namely selecting a potential father node set for each gene by using a coefficient matrix between the genes obtained in the step 1 as a basis for judging the potential father node set of the genes;

and step 3: structure learning, adopting a structure learning method based on score search, selecting BDe scores by a score function, setting the number of regulation sets of each gene as K, and aiming at the gene xiWith a set of potential parent nodes PiTraversing sets of possible combinations of genes in the search space for a search space, computing BDe scores score for each of the sets, and judging the sets as x genes according to the scoresiThe quality of the parent node set;

and 4, step 4: local network merging, each gene corresponding to a local network GiWill [ G ]1,……,Gn]Merge into global network G [ x [ [ x ]1,G1],[x2,G2],…,[xn,Gn]]Let gene y ∈ GiThen gene xiThe existence of a regulatory relationship with gene y is represented as y → xi

And 5: and parameter learning, namely performing parameter learning on each regulation and control relation in the global network G, wherein the parameters comprise a regulation and control effect and a regulation and control probability, the regulation and control effect is expressed as excitation or inhibition, and the regulation and control probability is expressed as the posterior probability of a regulation and control gene and a target gene.

The step 1 of calculating a coefficient matrix, determining the correlation among the genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes, and using the correlation as a basis for screening a potential father node set, specifically comprises the following steps:

step (ii) of1.1: expressing the gene expression data as a matrix X (m × n), wherein m represents the number of samples of the gene expression data, n represents the total number of genes, each row of the matrix X (m × n) is an expression data vector of one gene, and the expression data vector specifically expressed as the ith gene is expressed as XiDefining a matrix of 3 coefficients MPearson(n x n) calculation of Pearson coefficients for deposit between genes, definition Mmi(n x n) for storing the calculation results of mutual information, defining MMIC(n x n) for storing the calculation result of the maximum mutual information;

step 1.2: as the Pearson coefficient, the mutual information and the maximum mutual information all have symmetrical attributes, only the upper triangular matrix needs to be calculated when the 3 coefficient matrixes are calculated, and the gene x is calculated by using the formula (1)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnStoring the calculated Pearson coefficient to MPearsonIn (n x n), M is specificallyPearson[i,i+1],MPearson[i,i+2],……,MPearson[i,n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically MPearson[i+1,i],MPearson[i+2,i],……,MPearson[n,i]Finally obtaining MPearsonThe whole matrix expression of (2);

Figure BDA0002204067640000021

in the formula, r (X)iY) represents a gene xiExpression data vector XiAnd the Pearson coefficient of the expression data vector Y for gene Y, which belongs to { x ∈ [)i+1,xi+2,…,xn},XipRepresents gene xiExpression data vector XiP sample expression value of (2), YjpThe p-th sample expression value of the expression data vector Y representing the gene Y,

Figure BDA0002204067640000022

represents gene xiExpression data vector XiThe average value of the samples of (a),

Figure BDA0002204067640000031

a sample average value of an expression data vector Y representing a gene Y, and m represents the number of samples of gene expression data;

step 1.3: calculation of Gene x Using equation (2)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnThe mutual information value obtained by calculation is stored in MmiIn (n x n), M is specificallymi[i,i+1],Mmi[i,i+2],……,Mmi[i,n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically Mmi[i+1,i],Mmi[i+2,i],……,Mmi[n,i]Finally obtaining MmiIs given by the overall matrix expression of (a),

Figure BDA0002204067640000032

in the formula, I (X)iY) represents a gene xiExpression data vector XiAnd the mutual information value of the expression data vector Y of gene Y, gene Y is belonged to { xi+1,xi+2,…,xn},|C(Xi) I represents the gene xiExpression data vector XiThe value of the determinant of the covariance matrix of (a), | C (Y) | represents the value of the determinant of the covariance matrix of the expression data vector Y of the gene Y, | C (X)iY) represents a vector XiAnd a determinant of a covariance matrix of vector Y;

step 1.4: calculation of Gene x Using equation (3)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnThe maximum mutual information value obtained by calculation is stored in MMICIn (n x n), i.e. MMIC[i,i+1],MMIC[i,i+2],……,MMIC[i,n]And copying the maximum mutual information value obtained by calculation to a corresponding lower triangle, specifically MMIC[i+1,i],MMIC[i+2,i],……,MMIC[n,i]Finally obtaining MMICIs given by the overall matrix expression of (a),

Figure BDA0002204067640000033

in the formula, mic (X)iY) represents a gene xiExpression data vector XiAnd the maximum mutual information value of the expression data vector Y of gene Y, which belongs to { x ∈ [)i+1,xi+2,…,xn},I(XiAnd Y) represents the gene xiExpression data vector XiAnd a mutual information value of an expression data vector Y of the gene Y, a represents a size of gridding in the X-axis direction of the coordinate axis, B represents a size of gridding in the Y-axis direction of the coordinate axis, and a value of a parameter B is B ═ m0.6Wherein m represents the number of samples of gene expression data.

The step 2 of structure prediction is to select a potential father node set for each gene by using the coefficient matrix between the genes obtained in the step 1 as a basis for judging the potential father node set of the genes, and the specific steps are as follows:

step 2.1: defining the proportion of the potential father node set as lambda, then, the number num selected by the potential father node set is lambda x n, n is the total number of genes, and the value of lambda is determined through experiments;

step 2.2: reading MPearsonMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes with the highest absolute value corresponding to the first num Pearson coefficient values are selected to be put into PpearsonPerforming the following steps;

step 2.3: reading MmiMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes corresponding to the top num mutual information values with the highest absolute values are selected to be put into PmiPerforming the following steps;

step 2.4: reading MMICMiddle baseX is due toiAnd removing said gene xiThe other genes are sorted according to the Pearson coefficient result, and the genes corresponding to the first num maximum mutual information values with the highest absolute values are selected to be put into PMICPerforming the following steps;

step 2.5: to Ppearson、PmiAnd PMICObtaining a set P by mergingiAnd combining said set PiAs gene xiA set of potential parent nodes.

The step 3 of structure learning adopts a structure learning method based on score search, a score function selects BDe scores, the number of regulation sets of each gene is set to be K, and the target gene X is subjected to target gene XiWith a set of potential parent nodes PiTraversing sets of possible genes in the search space for searching the space, calculating BDe scores for each set, and judging the sets as the genes x according to the scoresiThe quality of the parent node set is specifically expressed as:

step 3.1: initializing a small top pile S, defining the size of the small top pile S as K, and calculating a gene x by using a formula (4)iBDe score when the parent node set of (1) is empty0And will be

Figure BDA0002204067640000041

Storing the small top stack S;

wherein D ═ D1,D2,…,DnDenotes a gene { x }1,x2,…,xn"G denotes the sample set in x1,x2,…,xnIs a Bayesian network of nodes, Γ represents a Gamma function, qiIs the ith gene xiThe number of value combinations in the expression data vector of the parent node, riIs gene xiExpression data vector XiNumber of values in, mijkRepresents gene xiThe parent node of (2) takes the value of j, xiNumber of samples in k value, αijkParameters representing functions subject to Dirichlet distribution,

Figure BDA0002204067640000043

step 3.2: with potential parent node set PiSearching a space, arranging and combining genes in the search space, traversing the search space, calculating H sets according to a formula (5), and calculating BDe score for each set in the H sets by using a formula (4)xAnd sequentially compared with the BDE score in the elements of the top of the small top heap if scorex>score, then will scorexAnd scorexStoring the corresponding set into the small top heap S, if the small top heap S is full, deleting the heap top element, and then, storing the scorexAnd scorexStoring the corresponding set into a small top pile S; if scorexNot more than score, no logging operation is performed, x is a natural number and satisfies 0<x<H;

Figure BDA0002204067640000044

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene xiLocal network G ofiIn (1).

The step 5 of learning parameters, wherein the parameters of each regulation and control relation in the global network G are learned, the parameters comprise a regulation and control effect and a regulation and control probability, the regulation and control effect is expressed as excitation or inhibition, and the regulation and control probability is expressed as a posterior probability of a regulation and control gene and a target gene, and the method specifically comprises the following steps:

step 5.1: for the regulatory relationship y → xiGene y as a regulatory gene, gene xiAs the target gene, gene x was calculated using the formula (1)iAnd the Pearson coefficient of gene y, if said Pearson coefficient is greater than 0, defining gene xiThe regulation relation with the gene y is an incentive relation and is indicated by "+"; if the Pearson coefficientLess than 0, define gene xiThe regulation relation with the gene y is a suppression relation and is represented by "-";

step 5.2: for the regulatory relationship y → xiCalculating the gene x using the formula (6)iAnd the posterior probability of gene y,

Figure BDA0002204067640000051

in the formula, P (Y | X)i) Is represented by the gene xiExpression data vector XiAnd the posterior probability, P (X), calculated from the expression data vector Y of the gene Yi) Represents gene xiExpression data vector XiP (Y) represents the probability distribution of the expression data vector Y for gene Y, P (X) relative to the joint conditional probability of the expression data vectors for all regulatory genesi| Y) represents the gene Y as the gene xiThe conditional probability of the parent node of (1);

the calculation formula of the joint conditional probability is as follows:

Figure BDA0002204067640000052

in the formula, pa (x)i) Represents gene xiParent node set of, P (X)i| Y) represents the gene Y as the gene xiThe parent node of (2) is the calculated conditional probability.

The invention has the beneficial effects that:

the invention relates to a gene regulation and control network construction method based on structure prediction, which predicts a potential father node set of a gene by a method of combining a Person coefficient, mutual information and maximum mutual information, reduces the search range of structure learning, reduces the construction time of the gene regulation and control network to a certain extent, improves the calculation performance, can construct a large-scale gene regulation and control network more quickly and accurately, and further knows the gene regulation and control mechanism of organisms.

Drawings

FIG. 1 is a flowchart of a method for constructing a gene regulatory network based on structure prediction according to an embodiment of the present invention.

FIG. 2 is a flow chart of a structure prediction phase in an embodiment of the present invention.

Fig. 3 is a result of a parent node proportion experiment in the embodiment of the present invention.

FIG. 4 is a flowchart of a learning process of a gene regulatory network structure according to an embodiment of the present invention.

FIG. 5 is a flowchart of a learning process of gene regulatory network parameters in an embodiment of the present invention.

FIG. 6 is a time comparison diagram of a method for constructing a gene regulatory network based on structure prediction according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples: a construction method of a gene regulation network based on structure prediction is disclosed, the flow of which is shown in figure 1, and the construction method comprises a calculation coefficient matrix, a structure prediction process, a structure learning process of the gene regulation network and a parameter learning process of the gene regulation network, and specifically comprises the following steps:

step 1: calculating a coefficient matrix, determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes, taking the correlation as a basis for screening potential father node sets, setting genes with strong correlation as possible father node sets of current genes, and selecting escherichia coli genes to construct a gene regulation network in the implementation example, wherein the specific steps are as follows:

step 1.1: expressing the gene expression data as a matrix X (m × n), wherein m represents the number of samples of the gene expression data, n represents the total number of genes, each row of the matrix X (m × n) is an expression data vector of one gene, and the expression data vector specifically expressed as the ith gene is expressed as XiDefining a matrix of 3 coefficients MPearson(n x n) calculation of Pearson coefficients for deposit between genes, definition Mmi(n x n) for storing the calculation results of mutual information, defining MMIC(n x n) for storing the calculation result of the maximum mutual information;

step 1.2: since the Pearson coefficient, mutual information and maximum mutual information all have symmetrical attributes, the 3 coefficients are calculatedWhen the matrix is used, only the upper triangular matrix is needed to be calculated, and the gene x is calculated by using the formula (1)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnStoring the calculated Pearson coefficient to MPearsonIn (n x n), M is specificallyPearson[i,i+1],MPearson[i,i+2],……,MPearson[i,n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically MPearson[i+1,i],MPearson[i+2,i],……,MPearson[n,i]Finally obtaining MPearsonThe Pearson coefficient is widely used to measure the degree of correlation between two variables, with values between-1 and 1,

Figure BDA0002204067640000061

in the formula, r (X)iY) represents a gene xiExpression data vector XiAnd the Pearson coefficient of the expression data vector Y for gene Y, which belongs to { x ∈ [)i+1,xi+2,…,xn},XipRepresents gene xiExpression data vector XiP sample expression value of (2), YjpThe p-th sample expression value of the expression data vector Y representing the gene Y,represents gene xiExpression data vector XiThe average value of the samples of (a),

Figure BDA0002204067640000071

a sample average value of an expression data vector Y representing a gene Y, and m represents the number of samples of gene expression data;

step 1.3: calculation of Gene x Using equation (2)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnTo each otherThe value of the information, storing the mutual information value obtained by calculation in MmiIn (n x n), M is specificallymi[i,i+1],Mmi[i,i+2],……,Mmi[i,n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically Mmi[i+1,i],Mmi[i+2,i],……,Mmi[n,i]Finally obtaining MmiThe mutual information is usually used for measuring the reliability between two variables X and Y, so that the correlation between two genes can be obtained by calculating gene expression data;

Figure BDA0002204067640000072

in the formula, I (X)iY) represents a gene xiExpression data vector XiAnd the mutual information value of the expression data vector Y of gene Y, gene Y is belonged to { xi+1,xi+2,…,xn},|C(Xi) I represents the gene xiExpression data vector XiThe value of the determinant of the covariance matrix of (a), | C (Y) | represents the value of the determinant of the covariance matrix of the expression data vector Y of the gene Y, | C (X)iY) represents a vector XiAnd a determinant of a covariance matrix of vector Y;

step 1.4: calculation of Gene x Using equation (3)iExpression data vector XiAnd gene xi+1,xi+2,…,xnExpression data vector Xi+1,Xi+2,…XnThe maximum mutual information value obtained by calculation is stored in MMICIn (n x n), i.e. MMIC[i,i+1],MMIC[i,i+2],……,MMIC[i,n]And copying the maximum mutual information value obtained by calculation to a corresponding lower triangle, specifically MMIC[i+1,i],MMIC[i+2,i],……,MMIC[n,i]Finally obtaining MMICThe maximum mutual Information belongs to the maximum Information-based Nonparametric Exploration of maximum Information for measuring the linear of two variables X and YOr the intensity of a non-linear or non-linear,

in the formula, mic (X)iY) represents a gene xiExpression data vector XiAnd the maximum mutual information value of the expression data vector Y of gene Y, which belongs to { x ∈ [)i+1,xi+2,…,xn},I(XiAnd Y) represents the gene xiExpression data vector XiAnd a mutual information value of an expression data vector Y of the gene Y, a and B respectively represent the size of gridding (essentially, grid distribution) in the X-axis and Y-axis directions of the coordinate axis, and the value of the parameter B is B-m0.6Wherein m represents the number of samples of gene expression data.

Step 2: and (2) structure prediction, namely selecting a potential father node set for each gene by using a coefficient matrix between the genes obtained in the step (1) as a basis for judging the potential father node sets of the genes, wherein the specific steps are as follows:

step 2.1: defining the proportion of the potential father node set as lambda, then, the number num selected by the potential father node set is lambda x n, n is the total number of genes, and the value of lambda is determined through experiments;

FIG. 3 shows the results of experiments on the ratio of father nodes in the embodiment of the present invention, the experiments employ the expression data of E.coli, the selected genes are 10, 20, 30, 40 and 50, the sample number is 1565, the ratios of father node sets are set to 0.1, 0.2, 0.3 and 0.4, respectively, the method is compared with the Bayesian network-based method, comparing the effects according to the final gene regulation network result, wherein the main indexes comprise TPR (true positive rate), FPR (false positive rate), PPV (precision rate), Sensitivity, Specificity, RECALL (RECALL rate), ACC (accuracy rate) and F-score, and the results can be obtained, when the father node ratio reaches 0.2, the method achieves the similar effect with the comparison method, when the ratio reaches 0.3 and 0.4, the effect presents a convergence state, but the time effect is reduced, so the father node ratio can be set to 0.2.

Step 2.2: reading MPearsonMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes with the highest absolute value corresponding to the first num Pearson coefficient values are selected to be put into PpearsonPerforming the following steps;

step 2.3: reading MmiMiddle gene xiAnd removing said gene xiThe Pearson coefficient results of other genes are sorted, and the genes corresponding to the top num mutual information values with the highest absolute values are selected to be put into PmiPerforming the following steps;

step 2.4: reading MMICMiddle gene xiAnd removing said gene xiThe other genes are sorted according to the Pearson coefficient result, and the genes corresponding to the first num maximum mutual information values with the highest absolute values are selected to be put into PMICPerforming the following steps;

step 2.5: to Ppearson、PmiAnd PMICObtaining a set P by mergingiAnd combining said set PiAs gene xiA set of potential parent nodes.

FIG. 2 is a flow chart of the structure prediction process for predicting gene x in the embodiment of the present inventioniA set of potential parent nodes of, wherein V1,V2,V3Wherein 1, … …, n are respectively the same as the gene xiThe value of the index value of the gene to be calculated is the corresponding coefficient value taken from the coefficient matrix.

And step 3: structure learning, adopting a structure learning method based on score search, selecting BDe (Bayesian Dirichlet) scores by a score function, BDe the score function (Bayesian Dirichlet), seeking a network structure with the maximum probability by using priori knowledge and data, setting the number of regulation and control sets of each gene as K, and regarding gene xiWith a set of potential parent nodes PiFor a search space, traverse sets (structures) of possible composition of genes within the search space, calculate BDe score for each of the sets, and judge the set as gene x based on the score leveliThe quality of the father node set, and the gene regulation network structure learning process flow chart are shown in fig. 4, and the specific steps are as follows:

step 3.1: initialization of oneSmall top pile S, defining the size of small top pile S as K, utilizing formula (4) to calculate gene xiBDe score when the parent node set of (1) is empty0And will be

Figure BDA0002204067640000081

Storing the small top stack S;

Figure BDA0002204067640000091

wherein D ═ D1,D2,…,DnDenotes a gene { x }1,x2,…,xn"G denotes the sample set in x1,x2,…,xnIs a Bayesian network of nodes, Γ represents a Gamma function, qiIs the ith gene xiThe number of value combinations in the expression data vector of the parent node, riIs gene xiExpression data vector XiNumber of values in, mijkRepresents gene xiThe parent node of (2) takes the value of j, xiNumber of samples in k value, αijkParameters representing functions subject to Dirichlet distribution,

Figure BDA0002204067640000092

step 3.2: with potential parent node set PiSearching a space, arranging and combining genes in the search space, traversing the search space, calculating H sets (structures) according to a formula (5), and calculating BDe score for each set in the H sets by using a formula (4)xAnd sequentially compared with the BDE score in the elements of the top of the small top heap if scorex>score, then will scorexAnd scorexStoring the corresponding set into the small top heap S, if the small top heap S is full, deleting the heap top element, and then, storing the scorexAnd scorexStoring the corresponding set into a small top pile S; if scorexNot more than score, no logging operation is performed, x is a natural number and satisfies 0<x<H;

Figure BDA0002204067640000093

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene xiLocal network G ofiIn (1).

And 4, step 4: local network merging, each gene corresponding to a local network GiWill [ G ]1,……,Gn]Merge into global network G [ x [ [ x ]1,G1],[x2,G2],…,[xn,Gn]]Let gene y ∈ GiThen gene xiThe existence of a regulatory relationship with gene y is represented as y → xi

And 5: parameter learning, wherein the parameters of each regulation and control relation in the global network G are learned, the parameters comprise regulation and control action and regulation and control probability, the regulation and control action is expressed as excitation or inhibition, the regulation and control probability expresses posterior probability of a regulation and control gene and a target gene, a flow chart of a gene regulation and control network parameter learning process is shown in FIG. 5, and the specific steps are as follows:

step 5.1: for the regulatory relationship y → xiGene y as a regulatory gene, gene xiAs the target gene, gene x was calculated using the formula (1)iAnd the Pearson coefficient of gene y, if said Pearson coefficient is greater than 0, defining gene xiThe regulation relation with the gene y is an incentive relation and is indicated by "+"; defining a gene x if said Pearson coefficient is less than 0iThe regulation relation with the gene y is a suppression relation and is represented by "-";

step 5.2: for the regulatory relationship y → xiCalculating the gene x using the formula (6)iAnd the posterior probability of gene y,

Figure BDA0002204067640000101

in the formula, P (Y | X)i) Is represented by the gene xiExpression data vector XiAnd the posterior probability, P (X), calculated from the expression data vector Y of the gene Yi) Represents gene xiExpression data vector XiP (Y) represents the probability distribution of the expression data vector Y for gene Y, P (X) relative to the joint conditional probability of the expression data vectors for all regulatory genesi| Y) represents the gene Y as the gene xiThe conditional probability of the parent node of (1);

the calculation formula of the joint conditional probability is as follows:

Figure BDA0002204067640000102

in the formula, pa (x)i) Represents gene xiParent node set of, P (X)i| Y) represents the gene Y as the gene xiThe parent node of (2) is the calculated conditional probability.

Fig. 6 is a time comparison graph of the gene regulatory network construction method based on the structure prediction and the gene regulatory network construction method based on the bayesian model in the embodiment of the present invention, and the time when the parent node ratios are respectively 0.1, 0.2, 0.3, and 0.4 is recorded by respectively using the gene expression data with the gene numbers of 10, 20, 30, 40, and 50, and the result shows that the method is superior to the conventional method in terms of calculation time.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于HLA分型与结构的肿瘤新抗原的筛选方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!