Gene regulation and control network construction method based on structure prediction

文档序号：1615532 发布日期：2020-01-10 浏览：17次中文

阅读说明：本技术 一种基于结构预测的基因调控网络构建方法 (Gene regulation and control network construction method based on structure prediction ) 是由王之琼郭上慧曲路渲信俊昌钱唯于 2019-09-17 设计创作，主要内容包括：本发明提供一种基于结构预测的基因调控网络构建方法。包括以下内容：首先计算系数矩阵,通过计算基因之间的Pearson系数、互信息及最大互信息来确定基因之间的相关性,作为筛选潜在父节点集的依据；然后进行结构预测,将获得的基因之间的系数矩阵作为判定基因潜在父节点集的依据,为每个基因选取潜在父节点集；最后进行基因调控网络的结构学习和参数学习。本发明通过基于Person系数、互信息及最大互信息相结合的方法预测基因的潜在父节点集,缩小结构学习的搜索范围,在一定程度上减少了基因调控网络的构建时间,提升计算性能,可以更加快速、准确地构建大规模基因调控网络,进一步了解生物的基因调控机制。(The invention provides a gene regulation and control network construction method based on structure prediction. The method comprises the following steps: firstly, calculating a coefficient matrix, and determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes to serve as a basis for screening a potential father node set; then, carrying out structure prediction, taking the obtained coefficient matrix between the genes as a basis for judging a potential father node set of the genes, and selecting the potential father node set for each gene; and finally, performing structure learning and parameter learning of the gene regulation network. The method predicts the potential father node set of the gene by a method based on the combination of the Person coefficient, the mutual information and the maximum mutual information, reduces the search range of structure learning, reduces the construction time of the gene regulation and control network to a certain extent, improves the calculation performance, can construct the large-scale gene regulation and control network more quickly and accurately, and further knows the gene regulation and control mechanism of the organism.)

1. A gene regulation network construction method based on structure prediction is characterized by comprising the following steps:

step 1: calculating a coefficient matrix, and determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes to serve as a basis for screening a potential father node set;

step 2: structure prediction, namely selecting a potential father node set for each gene by using a coefficient matrix between the genes obtained in the step 1 as a basis for judging the potential father node set of the genes;

and step 3: structure learning, adopting a structure learning method based on score search, selecting BDe scores by a score function, setting the number of regulation sets of each gene as K, and performing structure learningIn gene x_iWith a set of potential parent nodes P_iTraversing sets of possible combinations of genes in the search space for a search space, computing BDe scores score for each of the sets, and judging the sets as x genes according to the scores_iThe quality of the parent node set;

and 4, step 4: local network merging, each gene corresponding to a local network G_iWill [ G ]₁，……，G_n]Merge into global network G [ x [ [ x ]₁,G₁],[x₂,G₂],…,[x_n,G_n]]Let gene y ∈ G_iThen gene x_iThe existence of a regulatory relationship with gene y is represented as y → x_i；

And 5: and parameter learning, namely performing parameter learning on each regulation and control relation in the global network G, wherein the parameters comprise a regulation and control effect and a regulation and control probability, the regulation and control effect is expressed as excitation or inhibition, and the regulation and control probability is expressed as the posterior probability of a regulation and control gene and a target gene.

2. The method as claimed in claim 1, wherein the step 1 of calculating a coefficient matrix, determining the correlation between genes by calculating Pearson coefficients, mutual information and maximum mutual information between genes, as a basis for screening a potential parent node set, comprises the following steps:

step 1.1: expressing the gene expression data as a matrix X (m × n), wherein m represents the number of samples of the gene expression data, n represents the total number of genes, each row of the matrix X (m × n) is an expression data vector of one gene, and the expression data vector specifically expressed as the ith gene is expressed as X_iDefining a matrix of 3 coefficients M_Pearson(n x n) calculation of Pearson coefficients for deposit between genes, definition M_mi(n x n) for storing the calculation results of mutual information, defining M_MIC(n x n) for storing the calculation result of the maximum mutual information;

step 1.2: the Pearson coefficient, mutual information and maximum mutual information all have symmetrical attributes, so the calculation is carried outWhen the 3 coefficient matrixes are used, only the upper triangular matrix needs to be calculated, and the gene x is calculated by using the formula (1)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nStoring the calculated Pearson coefficient to M_PearsonIn (n x n), M is specifically_Pearson[i，i+1]，M_Pearson[i，i+2]，……，M_Pearson[i，n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically M_Pearson[i+1，i]，M_Pearson[i+2，i]，……，M_Pearson[n，i]Finally obtaining M_PearsonThe whole matrix expression of (2);

in the formula, r (X)_iY) represents a gene x_iExpression data vector X_iAnd the Pearson coefficient of the expression data vector Y for gene Y, which belongs to { x ∈ [)_i+1,x_i+2,…,x_n}，X_ipRepresents gene x_iExpression data vector X_iP sample expression value of (2), Y_jpThe p-th sample expression value of the expression data vector Y representing the gene Y,represents gene x_iExpression data vector X_iThe average value of the samples of (a),

step 1.3: calculation of Gene x Using equation (2)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nThe mutual information value obtained by calculation is storedPut to M_miIn (n x n), M is specifically_mi[i，i+1]，M_mi[i，i+2]，……，M_mi[i，n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically M_mi[i+1，i]，M_mi[i+2，i]，……，M_mi[n，i]Finally obtaining M_miIs given by the overall matrix expression of (a),

in the formula, I (X)_iY) represents a gene x_iExpression data vector X_iAnd the mutual information value of the expression data vector Y of gene Y, gene Y is belonged to { x_i+1,x_i+2,…,x_n}，|C(X_i) I represents the gene x_iExpression data vector X_iThe value of the determinant of the covariance matrix of (a), | C (Y) | represents the value of the determinant of the covariance matrix of the expression data vector Y of the gene Y, | C (X)_iY) represents a vector X_iAnd a determinant of a covariance matrix of vector Y;

step 1.4: calculation of Gene x Using equation (3)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nThe maximum mutual information value obtained by calculation is stored in M_MICIn (n x n), i.e. M_MIC[i，i+1]，M_MIC[i，i+2]，……，M_MIC[i，n]And copying the maximum mutual information value obtained by calculation to a corresponding lower triangle, specifically M_MIC[i+1，i]，M_MIC[i+2，i]，……，M_MIC[n，i]Finally obtaining M_MICIs given by the overall matrix expression of (a),

in the formula, mic (X)_iY) represents a gene x_iExpression data vector X_iAnd the maximum mutual information value of the expression data vector Y of gene Y,gene y belongs to { x ∈ }_i+1,x_i+2,…,x_n}，I(X_iAnd Y) represents the gene x_iExpression data vector X_iAnd a mutual information value of an expression data vector Y of the gene Y, a represents a size of gridding in the X-axis direction of the coordinate axis, B represents a size of gridding in the Y-axis direction of the coordinate axis, and a value of a parameter B is B ═ m^0.6Wherein m represents the number of samples of gene expression data.

3. The method for constructing a gene regulatory network based on structural prediction as claimed in claim 1, wherein the structural prediction of step 2 is performed by selecting a potential parent node set for each gene by using a coefficient matrix between genes obtained in step 1 as a basis for determining the potential parent node set of the gene, and the specific steps are as follows:

step 2.1: defining the proportion of the potential father node set as lambda, then, the number num selected by the potential father node set is lambda x n, n is the total number of genes, and the value of lambda is determined through experiments;

step 2.2: reading M_PearsonMiddle gene x_iAnd removing said gene x_iThe Pearson coefficient results of other genes are sorted, and the genes with the highest absolute value corresponding to the first num Pearson coefficient values are selected to be put into P_pearsonPerforming the following steps;

step 2.3: reading M_miMiddle gene x_iAnd removing said gene x_iThe Pearson coefficient results of other genes are sorted, and the genes corresponding to the top num mutual information values with the highest absolute values are selected to be put into P_miPerforming the following steps;

step 2.4: reading M_MICMiddle gene x_iAnd removing said gene x_iThe other genes are sorted according to the Pearson coefficient result, and the genes corresponding to the first num maximum mutual information values with the highest absolute values are selected to be put into P_MICPerforming the following steps;

step 2.5: to P_pearson、P_miAnd P_MICObtaining a set P by merging_iAnd combining said set P_iAs gene x_iA set of potential parent nodes.

4. The method as claimed in claim 1, wherein the step 3 of structure learning adopts a structure learning method based on score search, the score function selects BDe scores, the number of the regulation sets of each gene is set to K, and the target gene X is targeted_iWith a set of potential parent nodes P_iTraversing sets of possible genes in the search space for searching the space, calculating BDe scores for each set, and judging the sets as the genes x according to the scores_iThe quality of the parent node set is specifically expressed as:

step 3.1: initializing a small top pile S, defining the size of the small top pile S as K, and calculating a gene x by using a formula (4)_iBDe score when the parent node set of (1) is empty₀And will be

wherein D ═ D₁,D₂,…,D_nDenotes a gene { x }₁,x₂,…,x_n"G denotes the sample set in x₁,x₂,…,x_nIs a Bayesian network of nodes, Γ represents a Gamma function, q_iIs the ith gene x_iThe number of value combinations in the expression data vector of the parent node, r_iIs gene x_iExpression data vector X_iNumber of values in, m_ijkRepresents gene x_iThe parent node of (2) takes the value of j, x_iNumber of samples in k value, α_ijkParameters representing functions subject to Dirichlet distribution,

step 3.2: with potential parent node set P_iSearching a space, arranging and combining genes in the search space, traversing the search space, calculating H sets according to a formula (5), and calculating BDe score for each set in the H sets by using a formula (4)_xAnd sequentially compared with the BDE score in the elements of the top of the small top heap if score_x>score, then will score_xAnd score_xStoring the corresponding set into the small top heap S, if the small top heap S is full, deleting the heap top element, and then, storing the score_xAnd score_xStoring the corresponding set into a small top pile S; if score_xNot more than score, no logging operation is performed, x is a natural number and satisfies 0<x<H；

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene x_iLocal network G of_iIn (1).

5. The method for constructing a gene regulatory network based on structural prediction according to claim 1, wherein the step 5 of parameter learning is to perform parameter learning on each regulatory relationship in the global network G, the parameters include a regulatory effect and a regulatory probability, the regulatory effect is represented as excitation or inhibition, the regulatory probability is represented as a posterior probability of a regulatory gene and a target gene, and the specific steps are as follows:

step 5.1: for the regulatory relationship y → x_iGene y as a regulatory gene, gene x_iAs the target gene, gene x was calculated using the formula (1)_iAnd the Pearson coefficient of gene y, if said Pearson coefficient is greater than 0, defining gene x_iThe regulation relation with the gene y is an incentive relation and is indicated by "+"; defining a gene x if said Pearson coefficient is less than 0_iThe regulation relation with the gene y is a suppression relation and is represented by "-";

step 5.2: for the regulatory relationship y → x_iCalculating the gene x using the formula (6)_iAnd the posterior probability of gene y,

in the formula, P (Y | X)_i) Is represented by the gene x_iExpression data vector X_iAnd the posterior probability, P (X), calculated from the expression data vector Y of the gene Y_i) Represents gene x_iExpression data vector X_iP (Y) represents the probability distribution of the expression data vector Y for gene Y, P (X) relative to the joint conditional probability of the expression data vectors for all regulatory genes_i| Y) represents the gene Y as the gene x_iThe conditional probability of the parent node of (1);

the calculation formula of the joint conditional probability is as follows:

in the formula, pa (x)_i) Represents gene x_iParent node set of, P (X)_i| Y) represents the gene Y as the gene x_iThe parent node of (2) is the calculated conditional probability.

Technical Field

The invention relates to the field of medical informatics, in particular to a gene regulation and control network construction method based on structure prediction.

Background

Gene expression regulation, i.e., the process by which the expression of one gene is influenced by other genes, includes mainly regulation at the transcriptional and translational levels. Since the final form of gene expression is a protein, transcription to form mRNA is required as a template to ultimately produce the protein, gene regulation at the transcriptional level is critical. Transcription of one gene is affected by the expression product of another gene, acting as an excitation or inhibition, and the proteins produced by itself may affect other genes as well. This complex regulatory relationship ultimately constitutes a gene regulatory network. Understanding the gene regulation mechanism of organisms can understand the occurrence of various biological processes from the genetic point of view, reveal the adverse processes of organisms, and is important for finding the key principles and details of biological systems.

Some current gene regulation network models solve the problem of construction of a gene regulation network to a certain extent, but have some problems at the same time. For example, the Boolean model researches a gene regulation network in a qualitative angle, and is too coarse and simplified; the differential equation model describes the gene regulation and control network quantitatively and accurately through differential equations, but the problem of difficult optimization caused by excessive parameters is solved, and the calculated amount is huge. The Bayesian network model describes a gene regulation and control network through a probability model, and the regulation and control relation is represented by probability, but the calculation amount of the model is greatly increased along with the increase of the complexity of the network.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a gene regulation network construction method based on structure prediction. The construction of the global network can be decomposed into construction problems for the sub-networks, in the construction process of each sub-network, because a set possibly formed by all the remaining genes needs to be searched to obtain an optimal sub-network structure, the calculation time is in an exponential rising trend along with the increase of the number of the genes, a possible parent node set of each gene is predicted by a method based on the combination of a Pearson coefficient, mutual information and maximum mutual information, the optimal structure is searched in the parent node set, the search space is reduced to the greatest extent, and the calculation performance is improved.

In order to solve the technical problems, the invention provides a gene regulation and control network construction method based on structure prediction, which comprises the following steps:

and step 3: structure learning, adopting a structure learning method based on score search, selecting BDe scores by a score function, setting the number of regulation sets of each gene as K, and aiming at the gene x_iWith a set of potential parent nodes P_iTraversing sets of possible combinations of genes in the search space for a search space, computing BDe scores score for each of the sets, and judging the sets as x genes according to the scores_iThe quality of the parent node set;

The step 1 of calculating a coefficient matrix, determining the correlation among the genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes, and using the correlation as a basis for screening a potential father node set, specifically comprises the following steps:

step (ii) of1.1: expressing the gene expression data as a matrix X (m × n), wherein m represents the number of samples of the gene expression data, n represents the total number of genes, each row of the matrix X (m × n) is an expression data vector of one gene, and the expression data vector specifically expressed as the ith gene is expressed as X_iDefining a matrix of 3 coefficients M_Pearson(n x n) calculation of Pearson coefficients for deposit between genes, definition M_mi(n x n) for storing the calculation results of mutual information, defining M_MIC(n x n) for storing the calculation result of the maximum mutual information;

step 1.2: as the Pearson coefficient, the mutual information and the maximum mutual information all have symmetrical attributes, only the upper triangular matrix needs to be calculated when the 3 coefficient matrixes are calculated, and the gene x is calculated by using the formula (1)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nStoring the calculated Pearson coefficient to M_PearsonIn (n x n), M is specifically_Pearson[i，i+1]，M_Pearson[i，i+2]，……，M_Pearson[i，n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically M_Pearson[i+1，i]，M_Pearson[i+2，i]，……，M_Pearson[n，i]Finally obtaining M_PearsonThe whole matrix expression of (2);

represents gene x_iExpression data vector X_iThe average value of the samples of (a),

a sample average value of an expression data vector Y representing a gene Y, and m represents the number of samples of gene expression data;

step 1.3: calculation of Gene x Using equation (2)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nThe mutual information value obtained by calculation is stored in M_miIn (n x n), M is specifically_mi[i，i+1]，M_mi[i，i+2]，……，M_mi[i，n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically M_mi[i+1，i]，M_mi[i+2，i]，……，M_mi[n，i]Finally obtaining M_miIs given by the overall matrix expression of (a),

in the formula, mic (X)_iY) represents a gene x_iExpression data vector X_iAnd the maximum mutual information value of the expression data vector Y of gene Y, which belongs to { x ∈ [)_i+1,x_i+2,…,x_n}，I(X_iAnd Y) represents the gene x_iExpression data vector X_iAnd a mutual information value of an expression data vector Y of the gene Y, a represents a size of gridding in the X-axis direction of the coordinate axis, B represents a size of gridding in the Y-axis direction of the coordinate axis, and a value of a parameter B is B ═ m^0.6Wherein m represents the number of samples of gene expression data.

The step 2 of structure prediction is to select a potential father node set for each gene by using the coefficient matrix between the genes obtained in the step 1 as a basis for judging the potential father node set of the genes, and the specific steps are as follows:

step 2.4: reading M_MICMiddle baseX is due to_iAnd removing said gene x_iThe other genes are sorted according to the Pearson coefficient result, and the genes corresponding to the first num maximum mutual information values with the highest absolute values are selected to be put into P_MICPerforming the following steps;

step 2.5: to P_pearson、P_miAnd P_MICObtaining a set P by merging_iAnd combining said set P_iAs gene x_iA set of potential parent nodes.

The step 3 of structure learning adopts a structure learning method based on score search, a score function selects BDe scores, the number of regulation sets of each gene is set to be K, and the target gene X is subjected to target gene X_iWith a set of potential parent nodes P_iTraversing sets of possible genes in the search space for searching the space, calculating BDe scores for each set, and judging the sets as the genes x according to the scores_iThe quality of the parent node set is specifically expressed as:

Storing the small top stack S;

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene x_iLocal network G of_iIn (1).

The step 5 of learning parameters, wherein the parameters of each regulation and control relation in the global network G are learned, the parameters comprise a regulation and control effect and a regulation and control probability, the regulation and control effect is expressed as excitation or inhibition, and the regulation and control probability is expressed as a posterior probability of a regulation and control gene and a target gene, and the method specifically comprises the following steps:

step 5.1: for the regulatory relationship y → x_iGene y as a regulatory gene, gene x_iAs the target gene, gene x was calculated using the formula (1)_iAnd the Pearson coefficient of gene y, if said Pearson coefficient is greater than 0, defining gene x_iThe regulation relation with the gene y is an incentive relation and is indicated by "+"; if the Pearson coefficientLess than 0, define gene x_iThe regulation relation with the gene y is a suppression relation and is represented by "-";

step 5.2: for the regulatory relationship y → x_iCalculating the gene x using the formula (6)_iAnd the posterior probability of gene y,

the calculation formula of the joint conditional probability is as follows:

in the formula, pa (x)_i) Represents gene x_iParent node set of, P (X)_i| Y) represents the gene Y as the gene x_iThe parent node of (2) is the calculated conditional probability.

The invention has the beneficial effects that:

the invention relates to a gene regulation and control network construction method based on structure prediction, which predicts a potential father node set of a gene by a method of combining a Person coefficient, mutual information and maximum mutual information, reduces the search range of structure learning, reduces the construction time of the gene regulation and control network to a certain extent, improves the calculation performance, can construct a large-scale gene regulation and control network more quickly and accurately, and further knows the gene regulation and control mechanism of organisms.

Drawings

FIG. 1 is a flowchart of a method for constructing a gene regulatory network based on structure prediction according to an embodiment of the present invention.

FIG. 2 is a flow chart of a structure prediction phase in an embodiment of the present invention.

Fig. 3 is a result of a parent node proportion experiment in the embodiment of the present invention.

FIG. 4 is a flowchart of a learning process of a gene regulatory network structure according to an embodiment of the present invention.

FIG. 5 is a flowchart of a learning process of gene regulatory network parameters in an embodiment of the present invention.

FIG. 6 is a time comparison diagram of a method for constructing a gene regulatory network based on structure prediction according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples: a construction method of a gene regulation network based on structure prediction is disclosed, the flow of which is shown in figure 1, and the construction method comprises a calculation coefficient matrix, a structure prediction process, a structure learning process of the gene regulation network and a parameter learning process of the gene regulation network, and specifically comprises the following steps:

step 1: calculating a coefficient matrix, determining the correlation among genes by calculating Pearson coefficients, mutual information and maximum mutual information among the genes, taking the correlation as a basis for screening potential father node sets, setting genes with strong correlation as possible father node sets of current genes, and selecting escherichia coli genes to construct a gene regulation network in the implementation example, wherein the specific steps are as follows:

step 1.2: since the Pearson coefficient, mutual information and maximum mutual information all have symmetrical attributes, the 3 coefficients are calculatedWhen the matrix is used, only the upper triangular matrix is needed to be calculated, and the gene x is calculated by using the formula (1)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nStoring the calculated Pearson coefficient to M_PearsonIn (n x n), M is specifically_Pearson[i，i+1]，M_Pearson[i，i+2]，……，M_Pearson[i，n]And copying the calculated Pearson coefficient to a corresponding lower triangle, specifically M_Pearson[i+1，i]，M_Pearson[i+2，i]，……，M_Pearson[n，i]Finally obtaining M_PearsonThe Pearson coefficient is widely used to measure the degree of correlation between two variables, with values between-1 and 1,

a sample average value of an expression data vector Y representing a gene Y, and m represents the number of samples of gene expression data;

step 1.3: calculation of Gene x Using equation (2)_iExpression data vector X_iAnd gene x_i+1,x_i+2,…,x_nExpression data vector X_i+1,X_i+2,…X_nTo each otherThe value of the information, storing the mutual information value obtained by calculation in M_miIn (n x n), M is specifically_mi[i，i+1]，M_mi[i，i+2]，……，M_mi[i，n]At the same time, the mutual information value obtained by calculation is copied to the corresponding lower triangle, specifically M_mi[i+1，i]，M_mi[i+2，i]，……，M_mi[n，i]Finally obtaining M_miThe mutual information is usually used for measuring the reliability between two variables X and Y, so that the correlation between two genes can be obtained by calculating gene expression data;

in the formula, mic (X)_iY) represents a gene x_iExpression data vector X_iAnd the maximum mutual information value of the expression data vector Y of gene Y, which belongs to { x ∈ [)_i+1,x_i+2,…,x_n}，I(X_iAnd Y) represents the gene x_iExpression data vector X_iAnd a mutual information value of an expression data vector Y of the gene Y, a and B respectively represent the size of gridding (essentially, grid distribution) in the X-axis and Y-axis directions of the coordinate axis, and the value of the parameter B is B-m^0.6Wherein m represents the number of samples of gene expression data.

Step 2: and (2) structure prediction, namely selecting a potential father node set for each gene by using a coefficient matrix between the genes obtained in the step (1) as a basis for judging the potential father node sets of the genes, wherein the specific steps are as follows:

FIG. 3 shows the results of experiments on the ratio of father nodes in the embodiment of the present invention, the experiments employ the expression data of E.coli, the selected genes are 10, 20, 30, 40 and 50, the sample number is 1565, the ratios of father node sets are set to 0.1, 0.2, 0.3 and 0.4, respectively, the method is compared with the Bayesian network-based method, comparing the effects according to the final gene regulation network result, wherein the main indexes comprise TPR (true positive rate), FPR (false positive rate), PPV (precision rate), Sensitivity, Specificity, RECALL (RECALL rate), ACC (accuracy rate) and F-score, and the results can be obtained, when the father node ratio reaches 0.2, the method achieves the similar effect with the comparison method, when the ratio reaches 0.3 and 0.4, the effect presents a convergence state, but the time effect is reduced, so the father node ratio can be set to 0.2.

Step 2.2: reading M_PearsonMiddle gene x_iAnd removing said gene x_iThe Pearson coefficient results of other genes are sorted, and the genes with the highest absolute value corresponding to the first num Pearson coefficient values are selected to be put into P_pearsonPerforming the following steps;

step 2.5: to P_pearson、P_miAnd P_MICObtaining a set P by merging_iAnd combining said set P_iAs gene x_iA set of potential parent nodes.

FIG. 2 is a flow chart of the structure prediction process for predicting gene x in the embodiment of the present invention_iA set of potential parent nodes of, wherein V₁，V₂，V₃Wherein 1, … …, n are respectively the same as the gene x_iThe value of the index value of the gene to be calculated is the corresponding coefficient value taken from the coefficient matrix.

And step 3: structure learning, adopting a structure learning method based on score search, selecting BDe (Bayesian Dirichlet) scores by a score function, BDe the score function (Bayesian Dirichlet), seeking a network structure with the maximum probability by using priori knowledge and data, setting the number of regulation and control sets of each gene as K, and regarding gene x_iWith a set of potential parent nodes P_iFor a search space, traverse sets (structures) of possible composition of genes within the search space, calculate BDe score for each of the sets, and judge the set as gene x based on the score level_iThe quality of the father node set, and the gene regulation network structure learning process flow chart are shown in fig. 4, and the specific steps are as follows:

step 3.1: initialization of oneSmall top pile S, defining the size of small top pile S as K, utilizing formula (4) to calculate gene x_iBDe score when the parent node set of (1) is empty₀And will be

Storing the small top stack S;

step 3.2: with potential parent node set P_iSearching a space, arranging and combining genes in the search space, traversing the search space, calculating H sets (structures) according to a formula (5), and calculating BDe score for each set in the H sets by using a formula (4)_xAnd sequentially compared with the BDE score in the elements of the top of the small top heap if score_x>score, then will score_xAnd score_xStoring the corresponding set into the small top heap S, if the small top heap S is full, deleting the heap top element, and then, storing the score_xAnd score_xStoring the corresponding set into a small top pile S; if score_xNot more than score, no logging operation is performed, x is a natural number and satisfies 0<x<H；

Wherein g represents the total number of genes in the search space, and f represents the number of genes in the current traversed set;

step 3.3: storing the H sets of the small top stack S into a gene x_iLocal network G of_iIn (1).

And 4, step 4: local network merging, each gene corresponding to a local network G_iWill [ G ]₁，……，G_n]Merge into global network G [ x [ [ x ]₁,G₁],[x₂,G₂],…,[x_n,G_n]]Let gene y ∈ G_iThen gene x_iThe existence of a regulatory relationship with gene y is represented as y → x_i。

And 5: parameter learning, wherein the parameters of each regulation and control relation in the global network G are learned, the parameters comprise regulation and control action and regulation and control probability, the regulation and control action is expressed as excitation or inhibition, the regulation and control probability expresses posterior probability of a regulation and control gene and a target gene, a flow chart of a gene regulation and control network parameter learning process is shown in FIG. 5, and the specific steps are as follows:

step 5.2: for the regulatory relationship y → x_iCalculating the gene x using the formula (6)_iAnd the posterior probability of gene y,

the calculation formula of the joint conditional probability is as follows:

in the formula, pa (x)_i) Represents gene x_iParent node set of, P (X)_i| Y) represents the gene Y as the gene x_iThe parent node of (2) is the calculated conditional probability.

Fig. 6 is a time comparison graph of the gene regulatory network construction method based on the structure prediction and the gene regulatory network construction method based on the bayesian model in the embodiment of the present invention, and the time when the parent node ratios are respectively 0.1, 0.2, 0.3, and 0.4 is recorded by respectively using the gene expression data with the gene numbers of 10, 20, 30, 40, and 50, and the result shows that the method is superior to the conventional method in terms of calculation time.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于HLA分型与结构的肿瘤新抗原的筛选方法

Gene regulation and control network construction method based on structure prediction

相关技术

网友询问留言