Protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing

文档序号：1044844 发布日期：2020-10-09 浏览：33次中文

阅读说明：本技术 一种基于虚拟网络映射和云并行计算的蛋白质构象预测加速方法 (Protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing ) 是由侯维刚尹欣郭磊巩小雪于 2020-06-15 设计创作，主要内容包括：本发明公开了一种基于虚拟网络映射和云并行计算的蛋白质构象预测加速方法,包括将蛋白质构象预测问题转化为虚拟网络映射问题,基于上述数学模型构建一种蛋白质构象预测启发式算法,最后采用云并行计算预测蛋白质构象,把利用蛋白质构象预测启发式算法所求得的蛋白质折叠方向编码串作为初始种群中的一部分,并对种群进行子种群划分,每个子种群在各自的处理器上独立完成遗传算法对蛋白质构象的计算过程。子种群之间将具有最小自由能的蛋白质构象进行交换,继续执行遗传操作,直到到达规定的繁衍代数就停止操作。本发明建立蛋白质构象预测的数学模型,利用启发式和并行遗传算法,结合云并行计算加速预测蛋白质结构,能够准确高效地预测出蛋白质构象。(The invention discloses a protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing, which comprises the steps of converting a protein conformation prediction problem into a virtual network mapping problem, constructing a protein conformation prediction heuristic algorithm based on a mathematical model, finally predicting protein conformation by adopting cloud parallel computing, using a protein folding direction coding string obtained by using the protein conformation prediction heuristic algorithm as a part of an initial population, dividing the population into sub-populations, and independently finishing the calculation process of the protein conformation by a genetic algorithm on respective processors of each sub-population. The protein conformations with the minimum free energy are exchanged among the sub-populations, and the genetic manipulation is continued until a prescribed number of breeding generations is reached, and the manipulation is stopped. The invention establishes a mathematical model for predicting the protein conformation, utilizes heuristic and parallel genetic algorithms, and combines cloud parallel computing to accelerate the prediction of the protein structure, thereby accurately and efficiently predicting the protein conformation.)

1. A protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing is characterized by comprising the following steps:

step 1, protein conformation is transformed into virtual network mapping according to the following mode: the method comprises the following steps that an amino acid set contained in a certain peptide chain is V, a peptide bond set of each amino acid on a connecting chain is L, the peptide chain is abstracted into a virtual node set which is V, the virtual link set is an oriented virtual network of the L, S is an HP lattice point set, namely a physical network node set, and E is a link set connected with HP lattice points, namely a physical network link set;

step 2, establishing a protein conformation prediction model mapped by a virtual network;

step 3, randomly mapping virtual nodes which are represented by the peptide chain and are positioned at the head end and the tail end of the virtual network to any pair of physical lattice points in the physical network;

step 4, calculating the minimum free energy value and the path and folding direction code corresponding to the free energy value according to the protein conformation prediction model in the step 2;

and 5, dividing the folding direction codes in the step 4 into a plurality of sub-populations, and performing cloud parallel computing.

2. The protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing as claimed in claim 1, characterized in that: the protein conformation prediction model specifically comprises the following steps:

weight variable w_vRepresenting the hydrophilic and hydrophobic condition of the amino acid numbered v in the peptide chain, the variable value satisfying formula (1):

h represents an amino acid having a hydrophobic property, and P represents an amino acid having a hydrophilic property.

The virtual network node v representing a certain amino acid representation is mapped to a physical network lattice point s, and in order to ensure that the virtual node represented by any one amino acid representation can be mapped to only one physical lattice point, the following constraints are provided:

to ensure that each physical lattice point can accept at most one virtual node characterized by an amino acid, there are constraints:

a virtual link l representing a certain peptide bond representation is mapped to a link pointing from a physical lattice point s to a physical lattice point d, and in order to ensure that each virtual link can only be mapped to one physical link, there are constraints:

if the virtual node represented by a certain amino acid is mapped successfully, the virtual nodes adjacent to the virtual node satisfy the link flow conservation constraint on the physical network:

in the formula I_startDenotes the start of a virtual link l_endThen the end point of the virtual link l is indicated, i ═ l₁Or l₂；

for counting the number of adjacent but unconnected H-H structures in a certain protein conformation state, a new binary variable is addedIs defined as:

this statistical value of 1 should satisfy the following constraint:

the protein conformation with the smallest free energy satisfies formula (8), i.e., the smallest value of the opposite numbers of the total number of adjacent but unconnected H-H structures represents the most stable current protein conformation;

3. the protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing as claimed in claim 2, characterized in that: the specific calculation steps of the step 4 are as follows:

finding all paths between the pair of physical grid points mapped in step 3 that conform to the constraints of equations (2), (3), (4) and (5);

and then according to formulas (7) and (8), counting the free energy values of each path, and obtaining the minimum free energy value and the path and folding direction code corresponding to the free energy value.

4. The protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing as claimed in claim 3, characterized in that: the folding direction code represents the folding direction of each amino acid in the sequence when forming the protein conformation, and adopts an absolute direction expression, wherein 1 represents folding to the right, 2 represents folding to the up, 3 represents folding to the left, and 4 represents folding to the down.

5. The protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing as claimed in any one of claims 1-4, characterized in that: the cloud parallel computing comprises computing of a head node and a plurality of working nodes, each working node independently completes computing of a corresponding sub-population on protein conformation, when excellent individuals appear in each working node after operation, the optimal individuals in each working node are selected and transmitted to the head node for pairwise exchange, each working node replaces the worst individual with the optimal individual obtained by exchange, and protein conformation computing is continuously executed until a preset reproduction generation number is reached.

Technical Field

The invention relates to subject crossing technologies of communication, computers and bioengineering, in particular to a protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing.

Background

Protein is the basis of life activity, and the problem of protein conformation prediction is mainly to determine its folding path and protein structure in the natural state according to amino acid sequence, wherein the protein structure in the natural state is the most stable protein structure. The normal function of protein has an inseparable relationship with its structure, the study of protein structure is favorable for further understanding the function of protein, the study of protein conformation prediction problem can not only explore the basic process of life, but also promote the development of application fields of medicine, agriculture and biotechnology. For example, in the field of medicine, kuru, creutzfeldt-jakob disease, gerstmann syndrome, etc. have been successively found, and these diseases are caused by abnormal protein conformation. In addition, the protein spectrum can reflect the dynamic changes of human health and disease occurrence and development, and effectively prevent or intervene diseases, so that the protein spectrum can be widely applied to general theoretical research and practical application of medicine and pharmacology. In the agricultural field, crops can generate antibacterial protein to resist the invasion of foreign matters, and human beings can predict the structure of the protein by extracting the gene of the antibacterial protein so as to uncover the real characteristics and functions of the antibacterial protein, so that the antibacterial protein can be applied to more scenes. Although the catalytic efficiency of enzymes is industrially favored, it is important to modify the protein structure to design a stable protein suitable for industrial use because the structure and function of natural proteins are easily destroyed under actual environments such as high temperature, high pressure and extreme pH.

Because protein crystals are difficult to culture, the X-ray crystallography method for determining the protein structure has a long period for determining the crystal structure, and the multidimensional nuclear magnetic resonance method has large requirement on the sample and high requirement on purity, so that only small-molecule protein structures can be determined at present. Therefore, the determination of protein structure by a biological experiment method faces the limitations of high cost, severe experimental conditions, long determination period and the like, and a lot of protein structures in reality can only be predicted by a protein conformation algorithm. The existing protein conformation prediction algorithm has the problems of high complexity, low prediction speed, long time consumption, low prediction precision and the like. Therefore, for the problem of protein conformation prediction, accurate modeling is needed, a corresponding prediction algorithm is designed, and a computing system capable of accelerating prediction is built.

A two-dimensional HP lattice point model obtained by simplifying the hydrophilic and hydrophobic effects among amino acids in protein is a mathematical model which is most widely applied at present, and the model not only effectively simplifies the amino acid sequences, but also places the amino acid sequences obtained by simplification into a grid. According to the principle of molecular dynamics, the folded conformation satisfying the minimum free energy value of the protein in the lattice is the protein structure in the natural state. The free energy is defined as the inverse of the number of adjacent but unconnected H-H structures in the HP lattice point model. Therefore, the problem of predicting protein conformation can be successfully solved by finding a conformation which can maximize the number of H-H structures, or by realizing the optimal placement of each amino acid in the sequence and peptide bonds connecting the amino acids in the grid, so as to maximize the number of H-H structures.

The problem of predicting protein conformations is in fact the process of searching for protein structures with minimal free energy, which is essentially connected with how best to perform virtual network mapping in the field of communications (i.e. how best to deploy individual virtual network nodes and virtual links in the underlying physical network). The underlying physical network can be regarded as a two-dimensional HP lattice model, each virtual network node can be regarded as an amino acid with hydrophilicity (hydrophobicity) on a certain amino acid sequence (peptide chain), and each virtual network link can be regarded as a peptide bond linking two amino acids. Therefore, the protein conformation prediction problem can be converted into a virtual network mapping problem for modeling, and no relevant report is found at present. The problem of predicting protein structures by using a theoretical modeling method has proven to be NP-hard, the solution calculation amount is huge, and a virtual network mapping heuristic algorithm can more quickly solve the (approximate) optimal solution (protein structure with global minimum free energy) of a protein conformation prediction model, so that at present, an effective heuristic algorithm is not available.

In addition, the processes of predicting the protein conformation through mathematical modeling and heuristic algorithms are all serial, the actual efficiency is not high, and the problems of large time consumption, high calculation cost and the like still exist when a long amino acid sequence is predicted. With the advent of the big data age, cloud computing has become one of efficient computing modes and technical means for processing mass data. Therefore, it is necessary to combine parallel genetic algorithms with cloud-parallel computing systems to accelerate the prediction of protein conformation.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing.

In order to achieve the purpose, the invention adopts the technical scheme that the protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing comprises the following steps:

step 2, establishing a protein conformation prediction model mapped by a virtual network;

step 4, calculating the minimum free energy value and the path and folding direction code corresponding to the free energy value according to the protein conformation prediction model in the step 2;

and 5, dividing the folding direction codes in the step 4 into a plurality of sub-populations, and performing cloud parallel computing.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention provides a mathematical model of a protein conformation prediction problem based on virtual network mapping, converts the protein conformation prediction problem into a virtual network mapping problem, establishes a pure integer linear programming model of a protein folding problem, changes the problem into a simple linear expression rather than a self-defined nonlinear function, enables the problem of the predicted protein conformation to be solved more conveniently by the existing programs and methods aiming at the ILP problem, is easy to expand into a three-dimensional protein conformation, and is suitable for the three-dimensional protein conformation without any modification only by providing a physical network topology represented by a three-dimensional HP lattice point. For amino acid sequences of shorter length, this mathematical model can be solved by an integer linear programming tool (e.g., CPLEX).

(2) The invention provides a heuristic protein conformation prediction algorithm, which can be directly used for the protein conformation of a three-dimensional structure without any modification because the algorithm only operates two elements, namely a node and a link, and does not need to consider the two-dimensional or three-dimensional physical network in reality, thereby having the expansibility. Meanwhile, after a pair of source node and destination node is determined, the search process of the optimal conformation is completely independent, and the method has the advantage of parallel computation, and can predict the protein conformation more quickly for the amino acid sequence with shorter length.

(3) The invention provides a protein conformation prediction acceleration method based on a cloud parallel genetic algorithm and a cloud parallel computing platform, the platform can flexibly improve the algorithm according to requirements to solve the problem of protein folding, and the genetic algorithm has parallelism, so the parallel genetic algorithm which is suitable for being executed by the platform is improved by combining the construction characteristics of the platform, the time for predicting the protein conformation is shorter, and the speed is higher. For longer amino acid sequences, compared with a heuristic algorithm, the method has shorter solving time and improves the execution efficiency of the algorithm.

Drawings

FIG. 1 is a schematic diagram of a virtual network mapping-based protein conformation prediction problem provided in an embodiment of the present invention;

fig. 2 is a schematic diagram of a basic structure of a cloud parallel computing system according to an embodiment of the present invention;

fig. 3 is a flow chart of protein conformation accelerated prediction based on a cloud parallel genetic algorithm and a cloud parallel computing system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

As shown in fig. 1, in this embodiment, the problem of protein conformation prediction is transformed into the problem of virtual network mapping, where a peptide chain (amino acid sequence) contains V as a set of amino acids, L as a set of peptide bonds connecting each amino acid on the chain, and the structure of the cyclic peptide chain is not considered at all, and both satisfy | L | ═ V | -1 in quantity; weight variable w_vRepresenting the hydrophilic and hydrophobic condition of the amino acid numbered v in the peptide chain, the variable value satisfying formula (1):

and abstracting the peptide chain into a virtual node set which is V, and a virtual link set which is L directed virtual network. H represents an amino acid having a hydrophobic property, and P represents an amino acid having a hydrophilic property.

As shown in fig. 1, the peptide chain is abstracted into a directed virtual network with a virtual node set of V ═ V1, V2, V3, V4, and V5, and a virtual link set of L ═ L1, L2, L3, and L4, and the weight of each node in turn is W_V1＝1,W_V2＝1,W_V3＝0,W_V4＝1,W_V5＝1。

S ═ S1, S2, S3, … …, S25} is the set of HP grid points, i.e., the set of physical network nodes, E ═ { E1, E2, E3, … …, E40} is the set of links connecting the HP grid points, i.e., the set of physical network links,

a virtual network node v representing a representation of an amino acid is mapped to a physical network lattice point s. To ensure the fictitious representation of any amino acidNodes can only map to one physical lattice point, there is a constraint:

to ensure that each physical lattice point can accept at most one virtual node characterized by an amino acid, there are constraints:

assuming that a virtual node characterized by a certain amino acid is mapped successfully, the virtual nodes adjacent to the virtual node satisfy the link flow conservation constraint on the physical network:

in the formula (5), l_startDenotes the start of a virtual link l_endThen the end point of the virtual link l is indicated, i ═ l₁Or l₂。

Virtual link l representing a representation of a certain peptide bond₁Mapping to a link pointing from a physical grid point d to a physical grid point s;

virtual link l representing a representation of a certain peptide bond₂Mapping onto a link pointing from physical lattice point s to physical lattice point d'.

For counting the number of adjacent but unconnected H-H structures in a certain protein conformation state, a new binary variable is added

Is defined as:

this statistical value of 1 should satisfy the following constraint:

the protein conformation with the smallest free energy satisfies formula (8), i.e., the smallest value of the opposite of the total number of adjacent but unconnected H-H structures, represents the most stable protein conformation at present.

Minimum () represents a function that outputs the Minimum of the expression in parentheses.

Based on the above mathematical model, the heuristic algorithm for protein conformation prediction provided by this embodiment includes the following steps:

step 1: randomly mapping virtual nodes which are represented by peptide chains and are positioned at the head end and the tail end of a virtual network to any pair of physical lattice points in a physical network;

step 2: finding all paths between the mapped pair of physical grid points according to the constraints of the formulas (2), (3), (4) and (5);

and step 3: according to the above formulas (7) and (8), the free energy values of each path are counted, and the minimum free energy value and the path and folding direction code corresponding to the free energy value are obtained. The folding direction code indicates the folding direction of each amino acid in the sequence when forming a protein conformation, and the position of the first amino acid is first determined by using the absolute direction expression, 1 indicates folding to the right, 2 indicates folding to the up, 3 indicates folding to the left, and 4 indicates folding to the down. For example, a fragment of an amino acid sequence has a fold orientation encoding 234, which indicates that the second amino acid is above the first amino acid, the third amino acid is to the left of the second amino acid, and the fourth amino acid is below the third amino acid. As shown in fig. 1, the minimum free energy value of the currently mapped path is-1, and the corresponding folding direction code is 1143.

A schematic diagram of a basic structure of a cloud-parallel computing system for protein conformation prediction acceleration provided in this embodiment is shown in fig. 2, and includes: the platform consists of eight main boards with Intel Core i7-4790K with the main frequency of 4.0GHz, the parallel computing platform comprises a head node called Matlab task scheduling Manager (MJS) and a plurality of working nodes (workers), the MJS is responsible for splitting upper computing tasks and distributing the tasks to the lower working nodes, and the workers are responsible for computing each sub task and returning results.

The work flow diagram of the system is shown in fig. 3, and comprises the following steps:

step 1: initializing a population, wherein the initial population is an amino acid sequence folding direction coding string, the sequence length is the total number of amino acids in the sequence is marked as length, a random number sequence containing four numbers of 1, 2, 3 and 4 is generated, the folding direction of each amino acid in the sequence in the folding process is random, and the length of the folding direction coding string is length-1. One part of the code string can be obtained by the protein conformation prediction heuristic algorithm (the other part is randomly generated by four numbers of 1, 2, 3 and 4, so that the free energy of the initial population is better than that of the initial population after genetic evolution), and the population is sub-clustered. The number of workers actually started by the cloud parallel computing platform is the sub-population number. Different numbers of worker can be started according to amino acid sequences with different lengths.

Step 2: and (3) putting each sub-population on a corresponding worker to independently complete the calculation process of the protein conformation by the genetic algorithm. Genetic algorithms the calculation of the protein conformation can be performed using genetic algorithms well known to those skilled in the art.

And step 3: when each worker has excellent individuals after running, selecting the optimal individual in each worker, namely the folding direction code corresponding to the protein conformation with the minimum free energy, transmitting the optimal individual to a Matlab task scheduling manager (Matlab JobSchedule, MJS) for pairwise exchange. Each worker replaces the worst individual with the optimal individual obtained by exchange, continues to execute genetic operation, observes the change condition of the minimum free energy of the protein conformation of the generations in the evolution process, sets a reasonable reproduction generation number, and stops operation until the specified reproduction generation number.

The time of prediction and the accuracy of the prediction result are the criteria for measuring the quality of the algorithm. The 12 more classical amino acid sequences shown in Table 1 were collected for performance testing in this example. For convenience of writing, for example, HHHHPPP is reduced to H₄P₃. In the table, the minimum free energy is the best solution that has been found to date for these amino acid sequences.

TABLE 1 HP sequence Listing to be tested

The 12 sequences to be tested are respectively calculated by a method 1 (directly solving a mathematical model of protein folding based on virtual network mapping by adopting IBM ILOG CPLEX Optimization Studio software), a method 2 (the protein conformation prediction heuristic algorithm) and a method 3 (the prediction acceleration method of the invention), and the obtained comparison results of the free energy values are recorded in a table 2.

TABLE 2 comparison of free energy values

As can be seen from table 2, for the protein sequences of the first five lengths, the objective function solution obtained by the solution using method 1 is the same as the actual minimum free energy value, so the accuracy of the model can be verified. The simulation results of method 2, which are the same as the free energy values obtained by method 1, are the best solutions that have been found for these sequences. Although the simulated solution of method 3 is somewhat different from method 1, it is also close to the minimum free energy value that has been found so far. For longer sequences (sequences 6, 7, 8, 9, 10, 11, 12), method 2 can only obtain a sub-optimal solution or other results closer to the optimal solution, while method 3 can obtain a free energy value that is close to the minimum free energy value that has been found to date, although not the minimum free energy value.

Table 3 reports the execution times required for these three methods to predict protein conformation. As can be seen from table 3, both methods 2 and 3 performed less time than method 1. Thus, the method of the present invention can accelerate the prediction of protein conformation. Although the execution time of method 2 is very short for shorter amino acid sequences, the time of method 3 is shorter for longer amino acid sequences.

TABLE 3 time comparison table

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

11页详细技术资料下载

Protein conformation prediction acceleration method based on virtual network mapping and cloud parallel computing

相关技术

网友询问留言