Gene sequence optimization method, device, equipment and medium

文档序号:972987 发布日期:2020-11-03 浏览:35次 中文

阅读说明:本技术 一种基因序列优化方法、装置、设备及介质 (Gene sequence optimization method, device, equipment and medium ) 是由 李辰 蒋东东 金良 徐哲 赵雅倩 李仁刚 于 2020-06-24 设计创作,主要内容包括:本申请公开了一种基因序列优化方法、装置、设备及介质,包括:随机生成初代种群,作为初始运算种群,进行迭代运算;对初始运算种群聚类;对初始运算种群中的个体进行适应度计算;根据聚类结果筛选个体,得到目标种群;从目标种群中选择出第一预设数量个个体组进行交叉计算;从目标种群中随机选择出第二预设数量个个体进行变异;判断目标种群是否达到预设收敛条件;若未达到预设收敛条件,将当前目标种群确定为初始运算种群,继续迭代;若达到预设收敛条件,停止迭代,从目标种群中确定出第一目标个体,实现对待优化基因序列的优化;第一目标个体包括每一聚类中适应度最高的个体。能够实现种群收敛,准确获得稳定的多极值解,有效的优化基因序列。(The application discloses a gene sequence optimization method, a device, equipment and a medium, comprising the following steps: randomly generating an initial generation population as an initial operation population, and performing iterative operation; clustering the initial operation population; carrying out fitness calculation on individuals in the initial operation population; screening individuals according to the clustering result to obtain a target population; selecting a first preset number of individual groups from the target population for cross calculation; randomly selecting a second preset number of individuals from the target population for variation; judging whether the target population reaches a preset convergence condition or not; if the preset convergence condition is not met, determining the current target population as an initial operation population, and continuing iteration; if the preset convergence condition is reached, stopping iteration, determining a first target individual from the target population, and realizing optimization of the gene sequence to be optimized; the first target individuals include the individuals with the highest fitness in each cluster. The method can realize population convergence, accurately obtain stable multi-extreme value solution and effectively optimize the gene sequence.)

1. A method for optimizing a gene sequence, comprising:

step S11: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

step S12: clustering the initial operation population;

step S13: performing fitness calculation on each individual in the initial operational population by using a protein and gene interaction scoring function;

step S14: screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

step S15: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

step S16: randomly selecting a second preset number of individuals from the target population for variation;

step S17: judging whether the target population reaches a preset convergence condition or not;

step S18: if the target population does not reach the preset convergence condition, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration;

step S19: if the target population reaches the preset convergence condition, stopping iteration, and determining a first target individual from the target population to realize optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

2. The method of gene sequence optimization according to claim 1, wherein said clustering said initial operational population comprises:

and clustering the initial operation population by using a K-means algorithm.

3. The method of gene sequence optimization according to claim 1, wherein said clustering said initial operational population comprises:

and clustering the initial operation population by using a DBSCAN algorithm.

4. The method for optimizing gene sequences according to claim 1, wherein the determining whether the target population reaches a preset convergence condition comprises:

and judging whether the evolution algebra of the target population reaches a preset algebra threshold value.

5. The method for optimizing gene sequences according to claim 1, wherein the determining whether the target population reaches a preset convergence condition comprises:

judging whether a target difference value corresponding to the target population is smaller than a preset difference threshold value or not;

the target difference value is calculated based on binary difference values between all individuals in the current target population and corresponding individuals in the previous generation target population; the binary difference is calculated by using binary codes corresponding to the individuals.

6. The method of claim 5, wherein the determining whether the target population meets a predetermined convergence condition comprises:

judging whether the difference quantity of gray code bits corresponding to the target population is smaller than a preset potential difference threshold value or not;

the Gray code bit difference quantity is a potential difference quantity determined based on the Gray code bit difference between all the individuals in the current target population and the corresponding individuals in the previous generation target population.

7. The method for optimizing gene sequences according to claims 1 to 6, wherein the step of screening the individuals meeting preset conditions from the initial operational population according to the clustering result to obtain a target population comprises the steps of:

screening a third preset number of second target individuals in each cluster; the fitness of the second target individual is higher than that of other individuals in the current cluster; after the second target individual is screened out, screening out a third target individual from other individuals in the initial operational population by using a roulette method to obtain the target population; the target population includes the second target individual and the third target individual.

8. A gene sequence optimizing apparatus comprising:

the initial generation population generating module 11 is configured to randomly generate an initial generation population, and perform iterative operation by using the initial generation population as an initial operation population; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

a population clustering module 12, configured to cluster the initial operation population;

a fitness calculation module 13, configured to perform fitness calculation on each individual in the initial operational population by using a protein-gene interaction scoring function;

the individual screening module 14 is configured to screen the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

the cross calculation module 15 is configured to select a first preset number of individual groups from the target population to perform cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

an individual variation module 16, configured to randomly select a second preset number of individuals from the target population for variation;

a convergence judging module 17, configured to judge whether the target population reaches a preset convergence condition;

an iteration control module 18, configured to determine the current target population as an initial operation population if the convergence judgment module 17 determines that the target population does not reach the preset convergence condition, and jump to the module 12 to continue iteration;

a target individual determining module 19, configured to stop iteration and determine a first target individual from the target population to optimize the gene sequence to be optimized if the target population reaches the preset convergence condition; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

9. A gene sequence optimization device comprising a processor and a memory; wherein the content of the first and second substances,

the memory is used for storing a computer program;

the processor for executing the computer program to implement the gene sequence optimization method according to any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the gene sequence optimization method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of gene sequence processing technologies, and in particular, to a method, an apparatus, a device, and a medium for optimizing a gene sequence.

Background

The Chinese name of DNA is deoxyribonucleic acid, which is a high molecular compound consisting of four basic units of deoxyribonucleotides. The four deoxyribonucleotides are composed of four bases, Adenine (Adenine), Cytosine (Cytosine), Guanine (Guanine) and Thymine (Thymine), which are abbreviated as A, T, C, G, and generally indicate the molecular structure of DNA (deoxyribonic Acid) by a base sequence. DNA calculations can be achieved by controlling the hybridization reaction between DNA molecules by biochemical means. Generally, DNA calculation has the advantages of innate parallelism, and has great advantages compared with the classical digital computer in solving a plurality of complex problems widely existing in the nature. DNA calculation needs to rely on powerful proteins as an auxiliary tool to complete various calculations. Therefore, it is important to design an appropriate DNA sequence based on the protein-DNA interaction. The genetic algorithm is a bionic algorithm which is inspired by biological evolution and solves a complex global optimization problem. The method is based on natural evolution theory and genetic variation theory, encodes an operation object, and simultaneously performs global search on a complex feasible region by using a probability search technology of a plurality of search points. The optimization algorithm does not need any auxiliary information such as gradient or high order and the like, and has unique advantages for solving the optimal solution under certain specific conditions. Genetic algorithms bear many similarities to DNA computing, and the core strategies in genetic algorithms, including crossover, variation, and population screening evolution, are all the innate features of DNA. Therefore, exploring protein-DNA interactions through genetic algorithms has natural advantages.

DNA sequences are usually screened to maximize their satisfaction using various constraints, which is essentially a multi-objective optimization problem, and a protein-DNA interaction scoring function is one of the important constraints. There is currently no good strategy in the industry for rapidly finding a DNA sequence that satisfies the optimal solution for protein-DNA interactions. The difficulty is two: (1) the method comprises the following steps of multi-extremum optimization, wherein a plurality of DNA sequences meeting extremum conditions are obtained when a scoring function is optimized, and it is difficult to find a plurality of groups of solutions meeting the conditions; (2) the multi-dimensional optimization has the advantages that a plurality of regions can be combined with proteins in a DNA sequence, the regions are orthogonal, so that the regions on the sequence need to be optimized simultaneously, and the multi-dimensional optimization has a large parameter space and is difficult to solve. The problems of prematurity, poor local search capability, low calculation efficiency and the like sometimes occur when the classical genetic algorithm is applied to solving the problem. In addition, as expected by the pattern theorem, even if the premature problem is solved, the genetic algorithm finally converges to a single optimal solution, other extreme values of the system cannot be obtained, and multi-extreme value optimization cannot be realized. The multi-extreme optimization is very important in the DNA sequence optimization problem: sometimes, a certain DNA sequence pattern has the strongest action strength with protein, and the scoring function is in the global optimum, but if the DNA sequence of the pattern is extremely unstable and cannot be used in the real situation, the DNA sequence meeting the suboptimal solution needs to be searched.

At present, for the problem of multi-extremum optimization of a DNA sequence, algorithms such as an adaptive genetic algorithm, an artificial immune algorithm, niche particle swarm optimization and the like exist to search extrema of a multi-peak function. However, the existing optimization algorithm has some defects, mainly including the following four points: (1) convergence is difficult in a multi-dimensional situation; (2) premature convergence on a local extremum solution and loss of a global optimal solution; (3) the number of the populations cannot be controlled, and the obtained result is unstable; (4) the algorithm is complex, the parameters are excessive, the prior conditions are excessive, and the applicable scene is limited.

Disclosure of Invention

In view of this, an object of the present application is to provide a method, an apparatus, a device and a medium for optimizing a gene sequence, which can achieve population convergence and accurately obtain a stable multi-extremum solution, thereby effectively optimizing the gene sequence. The specific scheme is as follows:

in a first aspect, the present application discloses a method for optimizing a gene sequence, comprising:

step S11: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

step S12: clustering the initial operation population;

step S13: performing fitness calculation on each individual in the initial operational population by using a protein and gene interaction scoring function;

step S14: screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

step S15: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

step S16: randomly selecting a second preset number of individuals from the target population for variation;

step S17: judging whether the target population reaches a preset convergence condition or not;

step S18: if the target population does not reach the preset convergence condition, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration;

step S19: if the target population reaches the preset convergence condition, stopping iteration, and determining a first target individual from the target population to realize optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

Optionally, the clustering the initial operation population includes:

and clustering the initial operation population by using a K-means algorithm.

Optionally, the clustering the initial operation population includes:

and clustering the initial operation population by using a DBSCAN algorithm.

Optionally, the determining whether the target population reaches a preset convergence condition includes:

and judging whether the evolution algebra of the target population reaches a preset algebra threshold value.

Optionally, the determining whether the target population reaches a preset convergence condition includes:

judging whether a target difference value corresponding to the target population is smaller than a preset difference threshold value or not;

the target difference value is calculated based on binary difference values between all individuals in the current target population and corresponding individuals in the previous generation target population; the binary difference is calculated by using binary codes corresponding to the individuals.

Optionally, the determining whether the target population reaches a preset convergence condition includes:

judging whether the difference quantity of gray code bits corresponding to the target population is smaller than a preset potential difference threshold value or not;

the Gray code bit difference quantity is a potential difference quantity determined based on the Gray code bit difference between all the individuals in the current target population and the corresponding individuals in the previous generation target population.

Optionally, the screening, according to the clustering result, the individuals meeting the preset condition from the initial operation population to obtain a target population includes:

screening a third preset number of second target individuals in each cluster; the fitness of the second target individual is higher than that of other individuals in the current cluster; after the second target individual is screened out, screening out a third target individual from other individuals in the initial operational population by using a roulette method to obtain the target population; the target population includes the second target individual and the third target individual.

In a second aspect, the present application discloses a gene sequence optimization apparatus, comprising:

the initial generation population generating module 11 is configured to randomly generate an initial generation population, and perform iterative operation by using the initial generation population as an initial operation population; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

a population clustering module 12, configured to cluster the initial operation population;

a fitness calculation module 13, configured to perform fitness calculation on each individual in the initial operational population by using a protein-gene interaction scoring function;

the individual screening module 14 is configured to screen the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

the cross calculation module 15 is configured to select a first preset number of individual groups from the target population to perform cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

an individual variation module 16, configured to randomly select a second preset number of individuals from the target population for variation;

a convergence judging module 17, configured to judge whether the target population reaches a preset convergence condition;

an iteration control module 18, configured to determine the current target population as an initial operation population if the convergence judgment module 17 determines that the target population does not reach the preset convergence condition, and jump to the module 12 to continue iteration;

a target individual determining module 19, configured to stop iteration and determine a first target individual from the target population to optimize the gene sequence to be optimized if the target population reaches the preset convergence condition; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

In a third aspect, the present application discloses a gene sequence optimization apparatus comprising a processor and a memory;

wherein the content of the first and second substances,

the memory is used for storing a computer program;

the processor is used for executing the computer program to realize the gene sequence optimization method.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the aforementioned gene sequence optimization method.

Therefore, the method includes the steps that an initial generation population is randomly generated at first, and iteration operation is carried out by taking the initial generation population as an initial operation population; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in a gene sequence to be optimized, then the initial operation population is clustered, then a protein-gene interaction scoring function is used for carrying out fitness calculation on each individual in the initial operation population, the individuals meeting preset conditions are screened out from the initial operation population according to a clustering result to obtain a target population, and then a first preset number of individual groups are selected out from the target population for cross calculation; the method comprises the steps that an individual group comprises two individuals, wherein the two individuals in any individual group belong to the same cluster, a second preset number of individuals are randomly selected from a target population for variation, whether the target population reaches a preset convergence condition is judged, if the target population does not reach the preset convergence condition, a current target population is determined as an initial operation population, iteration is continued, if the target population reaches the preset convergence condition, iteration is stopped, and a first target individual is determined from the target population, so that the optimization of a gene sequence to be optimized is realized; wherein the first target individuals comprise individuals with the highest fitness in each cluster. Therefore, by clustering the iterative population, population convergence can be realized, and a stable multi-extreme value solution can be accurately obtained, so that the gene sequence can be effectively optimized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for optimizing a gene sequence disclosed herein;

FIG. 2 is a schematic representation of a protein-DNA interaction disclosed herein;

FIG. 3 is a flow chart of a specific gene sequence optimization method disclosed herein;

FIG. 4 is a diagram illustrating an exemplary function optimization effect disclosed herein;

FIG. 5 is a flow chart of a specific gene sequence optimization method disclosed herein;

FIG. 6 is a flow chart of a specific gene sequence optimization method disclosed herein;

FIG. 7 is a flow chart of a specific gene sequence optimization method disclosed herein;

FIG. 8 is a schematic diagram of a gene sequence optimizing apparatus disclosed in the present application;

FIG. 9 is a diagram of a gene sequence optimizing apparatus according to the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, for the problem of multi-extremum optimization of a DNA sequence, algorithms such as an adaptive genetic algorithm, an artificial immune algorithm, niche particle swarm optimization and the like exist to search extrema of a multi-peak function. However, the existing optimization algorithm has some defects, mainly including the following four points: (1) convergence is difficult in a multi-dimensional situation; (2) premature convergence on a local extremum solution and loss of a global optimal solution; (3) the number of the populations cannot be controlled, and the obtained result is unstable; (4) the algorithm is complex, the parameters are excessive, the prior conditions are excessive, and the applicable scene is limited. Therefore, the gene sequence optimization scheme is provided, population convergence can be achieved, and a stable multi-extreme value solution can be accurately obtained, so that the gene sequence is effectively optimized.

Referring to fig. 1, the embodiment of the present application discloses a gene sequence optimization method, including:

step S11: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized.

In a specific embodiment, the encoding mode may be determined in advance in the embodiments of the present application, and in this embodiment, the encoding mode corresponding to the quaternary code of the DNA code is adopted, and the quaternary code may directly correspond to the DNA sequence on the one hand, and may be conveniently combined with DNA calculation in the future, and on the other hand, the required gene length is simplified compared with the binary code. Specifically, Gray codes can be adopted, and it is noted that the Gray codes are used for coding individual genes, the Hamming distances of the individuals obtained in clustering are closer, the space is more meaningful, and the similarity of two groups of genes can be directly compared from the sequence.

It is noted that in the prior art, the most natural way is to choose a standard binary code. Binary coding has a discontinuity problem in numerical calculations. For example, counting is incremented by a natural number, and if a binary code is used, four bits change when the number (or gene) 0111 changes to 1000, and in reality the two are very close. The individual calculation of such codes is not a problem in itself, but introduces human error to the clustering operation. Thus, the present embodiment may employ Gray code pairsThe individuals in the population are encoded such that the encoding may be varied in a sequence. The gray code is a single-step self-complementary code with reflection characteristic and cycle characteristic, and the cycle and single-step characteristic eliminates the possibility of great errors in random access. The gray code is a variable weight code, and each bit code has no fixed size, so that the size and arithmetic operation are difficult to directly compare, and the gray code can be converted into a natural binary code through one-time code conversion according to specific requirements to carry out corresponding operation. The formula for converting binary code into Gray code is as follows:wherein G is Gray code, B is binary code, i represents ith bit, and n-bit Gray code can be directly obtained from corresponding n-bit binary code word by the formula. The formula for converting Gray code decoding into binary code is as follows:

Figure BDA0002555778870000072

after conversion to gray code, the ordinary gray code can be encoded into the form of DNA code, G for binary 00, T for 01, a for 10, and C for 11. The conversion between the three codes is carried out according to the size of the system and the specific situation.

Step S12: and clustering the initial operation population.

In a specific implementation manner, the present embodiment may utilize a K-means (i.e., a K-means clustering algorithm) algorithm to cluster the initial operation population.

In another specific implementation, this embodiment may utilize a DBSCAN (i.e., a Density-based spatial Clustering of Applications with Noise, a Density-based spatial Clustering algorithm that is robust to Noise) algorithm to cluster the initial operation population.

It should be noted that the two clustering methods have advantages, wherein the advantage of K-means is that the calculation is fast, only one additional parameter (the number of clusters K) needs to be introduced, and the advantage of DBSCAN is that the algorithm is based on density, the number of clusters does not need to be preset, and the algorithm is still effective for non-convex functions. In practical application, K of the K-means is obtained by testing a plurality of groups of parameters arranged in parallel; the density radius setting in DBSCAN is also obtained by setting multiple sets of parameters in parallel (typically set to 0.5), with the minimum number of classes set to 5% of the population.

That is, in the embodiment of the present application, individuals with similar genes can be grouped into one type by using a preset clustering algorithm.

Step S13: and calculating the fitness of each individual in the initial operational population by utilizing a protein and gene interaction scoring function.

It is noted that protein-DNA interactions can generally be studied in silico by molecular docking, Monte Carlo, molecular dynamics, and the like. The strength of the interaction can be calculated by a scoring function, and in many cases, a suitable DNA sequence needs to be found by optimizing the scoring function of the protein-DNA interaction. A scoring function of the form:

Figure BDA0002555778870000081

where E is the interaction energy, i.e. the scoring function, i.e. the objective function to be optimized. i corresponds to each basic unit in DNA (biologically meaning the base, in the case of a computer, the quaternary code) and j corresponds to each basic unit in protein (biologically meaning the amino acid, in the case of a computer, a functional form). A, B, C, a and B are coefficients, r is the interaction distance between DNA and protein, and q is the charge.

It should be noted that the embodiment of the present application does not limit the specific form of the scoring function, and different scoring functions may be applied according to specific application scenarios.

For example, referring to FIG. 2, the present example discloses a protein-DNA interaction scheme.

Step S14: and screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population.

In a specific implementation manner, in the embodiment of the present application, a third preset number of second target individuals may be screened out from each cluster; the fitness of the second target individual is higher than that of other individuals in the current cluster; after the second target individual is screened out, screening out a third target individual from other individuals in the initial operational population by using a roulette method to obtain the target population; the target population includes the second target individual and the third target individual.

That is, the present embodiment may combine the elite selection strategy with the roulette strategy to screen populations according to the magnitude of fitness. Specifically, the first 5% of the individuals with the highest fitness in each cluster may be selected and retained, and the remaining individuals may be retained to the next generation using a roulette method with a probability of retaining

Figure BDA0002555778870000082

Where f (H) is the fitness of pattern H, and Σ f represents the sum of the fitness of all individuals in the population.

The present embodiment is a pattern theorem and a building block hypothesis, wherein the pattern theorem is as follows:

wherein t is the algebra of population screening, f (H) is the fitness of the mode H,is the mean value of population fitness, pcFor the crossover probability, (H) is the length of the pattern H, l is the string length, O (H) is the order of the pattern H, pmM represents the population as the probability of variation.

It should be noted that each mode can be understood as a building block, and genetic algorithms are combined with each other under the action of genetic operations through the building blocks, and the screening follows the mode theorem, so that the global optimal solution can still be approached under the condition of not depending on high-order information such as gradient and the like. That is, the present embodiment can finally achieve convergence.

Step S15: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals and wherein both individuals in any one of the groups of individuals belong to the same cluster.

In a specific embodiment, the present embodiment may select a first predetermined number of individual groups from the population, and exchange part of the genes for two individuals in the individual groups according to a certain manner. Following the concept of reproductive isolation, crossover occurs only between two individuals in the same cluster, and not between different clusters. For example, 20% of individuals in the population are selected for crossover operations.

Step S16: randomly selecting a second preset number of individuals from the target population for variation.

In a specific embodiment, in this embodiment, some individuals may be randomly selected from the population with a certain probability, and a certain gene randomly selected from these individuals is mutated.

Step S17: and judging whether the target population reaches a preset convergence condition.

Step S18: and if the target population does not reach the preset convergence condition, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration.

Step S19: if the target population reaches the preset convergence condition, stopping iteration, and determining a first target individual from the target population to realize optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

In a specific implementation manner, the embodiment may place the determined first target individual at a corresponding position of the gene sequence to be optimized, so as to optimize the gene sequence.

For example, referring to FIG. 3, FIG. 3 is a flow chart of a specific gene sequence optimization method disclosed in the examples of the present application.

In addition, the embodiment can test a multimodal function system to verify that the scheme can optimize various complex protein-DNA scoring functions and find a proper DNA sequence. The method for optimizing the DNA sequence can stably obtain a plurality of extreme points and has better performance. According to the scheme, the following three two-dimensional functions are taken as examples, and the test of the two-dimensional optimized DNA sequence is carried out:

peaks function:

roots function:

Figure BDA0002555778870000102

schaffers function:

wherein, x and y in the three functions are all corresponding gene segments.

Referring to fig. 4, fig. 4 is a schematic diagram of a specific function optimization effect disclosed in the present application, and in order to adopt the scheme provided in this embodiment, three functions are used to optimize the result after 20 rounds. For ease of analysis, three schematic functions commonly used for testing multi-peak function optimization were used as scoring functions, and the DNA sequence was converted to a floating point number that was binary accurate to the second place after the decimal point. If the optimization converges or terminates (the termination condition is set to 100 iterations), the result will converge to one (or several) points. For ease of viewing, fig. 5 shows only the results of the 20 rounds of optimization. Wherein, each column corresponds to a function, which is a Peaks function, a Roots function, and a Schaffers function.

The implementation can realize the solution of the multi-extreme scoring function. Through analysis, the classic Genetic Algorithm classicalGA (Genetic Algorithm ) can only find the optimal solution for the Peaks function and lose two suboptimal Peaks, while the optimization for the Roots function can only randomly obtain one of six peak values, the solution obtained for the Schaffers function is related to the peak area, the wider the peak area, the higher the probability of obtaining the solution, and the solution can be lost when the peak area is small; the method provided by the embodiment can be named as Clust-DNA/GA, and comprises a K-means/genetic algorithm (K-means/GA) and a DBSCAN/genetic algorithm (DNSCAN/GA), wherein the K-means/genetic algorithm can correctly obtain all extremum solutions of the first two functions, but has no capability to non-convex Schaffers functions; the DBSCAN/genetic algorithm can stably search all extremum solutions of the three functions.

See table one, which is a table of extreme cases of Peaks function found by taking DNSCAN/GA as an example.

See table two, which is an analysis table corresponding to the peak function of the Cluster-DNA/GA protocol test.

See Table three, which is an analysis table corresponding to the Clust-DNA/GA protocol test Roots function.

See Table four for an analysis table corresponding to the Schaffers function of the Cluster-DNA/GA protocol test.

Watch 1

First extreme point Second extreme point Third pole point
Solution to extreme values (-0.01,1.58) (-0.46,0.63) (1.29,0.00)
Maximum value 8.11 3.78 3.59

Watch two

Algorithm Calculating time Number of iterations Number of peak searching Rate of accuracy
GA 506.1s 100 1 33.3%
K-means/GA 523.6s 100 3 100%
DBSCAN/GA 529.7s 100 3 100%

Watch III

Algorithm Calculating time Number of iterations Number of peak searching Rate of accuracy
GA 432.7s 100 1 16.6%
K-means/GA 459.2s 100 6 100%
DBSCAN/GA 461.8s 100 6 100%

Watch four

Algorithm Calculating time Number of iterations Number of peak searching Rate of accuracy
GA 891.5s 100 1 59.1%
K-means/GA 929.3s 100 3 54.5%
DBSCAN/GA 937.6s 100 11 100%

It can be seen that under the condition that the calculation time is not obviously improved by the Clust-DNA/GA method, the peak searching accuracy rate is far higher than that of the classical genetic algorithm, and the effect of 100% peak searching is achieved. In addition, the method provided by the embodiment has the advantages of high iteration speed, convergence, stable result and no premature occurrence in various test systems. The embodiment structurally combines a machine learning method with a genetic algorithm, greatly increases the application scenes of the genetic algorithm, and provides a specific idea for the direct combination of the genetic algorithm with algorithms such as dimension reduction, clustering and even a neural network.

It can be understood that, this embodiment proposes a scheme of combining clustering with a DNA heuristic genetic algorithm for the first time, and structurally and organically combines the two, so as to implement multi-extremum optimization on a protein-DNA scoring function. The scheme can be used for DNA sequence search and numerical optimization. In addition, in the implementation method of combining the clustering algorithm with the genetic algorithm, the gray code is adopted in the embodiment to enable the clustering to be executed more accurately, and then the fitness calculation is completed by decoding, so that the fitness calculation is connected with the genetic algorithm. Further, the clustering result is screened by using an elite selection strategy, so that the information of the clustering is reserved in the evolution, and other populations are screened and constructed by using roulette, so that the genetic algorithm can play the advantages of high quality and low quality. That is, in the embodiment of the present application, dnase is used for screening a DNA sequence in a specific mode in a natural environment as a heuristic, a method of combining clustering and genetic algorithm is used for reference to a concept of reproductive isolation in biological evolution, similarity calculation is performed on each generation of population through the clustering algorithm, and an optimal individual in each class is retained to the next generation, so that population classification evolution is realized, and multi-extreme optimization of a protein-DNA interaction scoring function is completed. For the sequence optimization system, direct clustering is carried out through DNA codes; for a numerical optimization system, clustering can be smoothly realized through Gray codes. And (3) obtaining a potential multi-extreme value solution through clustering, reserving a result obtained through clustering through an elite selection strategy, and playing the advantages of a genetic algorithm for screening and evolution.

Therefore, the method includes the steps that an initial generation population is randomly generated at first, and iteration operation is carried out by taking the initial generation population as an initial operation population; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in a gene sequence to be optimized, then the initial operation population is clustered, then a protein-gene interaction scoring function is used for carrying out fitness calculation on each individual in the initial operation population, the individuals meeting preset conditions are screened out from the initial operation population according to a clustering result to obtain a target population, and then a first preset number of individual groups are selected out from the target population for cross calculation; the method comprises the steps that an individual group comprises two individuals, wherein the two individuals in any individual group belong to the same cluster, a second preset number of individuals are randomly selected from a target population for variation, whether the target population reaches a preset convergence condition is judged, if the target population does not reach the preset convergence condition, a current target population is determined as an initial operation population, iteration is continued, if the target population reaches the preset convergence condition, iteration is stopped, and a first target individual is determined from the target population, so that the optimization of a gene sequence to be optimized is realized; wherein the first target individuals comprise individuals with the highest fitness in each cluster. Therefore, by clustering the iterative population, population convergence can be realized, and a stable multi-extreme value solution can be accurately obtained, so that the gene sequence can be effectively optimized.

Referring to fig. 5, the embodiment of the present application discloses a specific gene sequence optimization method, which comprises:

step S21: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

step S22: clustering the initial operation population;

step S23: performing fitness calculation on each individual in the initial operational population by using a protein and gene interaction scoring function;

step S24: screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

step S25: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

step S26: randomly selecting a second preset number of individuals from the target population for variation;

step S27: judging whether the evolution algebra of the target population reaches a preset algebra threshold value;

step S28: if the evolution algebra of the target population does not reach the preset algebra threshold value, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration;

step S29: if the evolution algebra of the target population reaches the preset algebra threshold value, stopping iteration, and determining a first target individual from the target population to realize the optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

That is, in this embodiment, the condition that the target population meets the convergence may be set such that the evolution algebra of the target population reaches the preset algebra threshold, and if the evolution algebra of the target population reaches the preset algebra threshold, it is determined that the target population meets the preset convergence condition.

Referring to fig. 6, the present application discloses a specific gene sequence optimization method, including:

step S31: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

step S32: clustering the initial operation population;

step S33: performing fitness calculation on each individual in the initial operational population by using a protein and gene interaction scoring function;

step S34: screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

step S35: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

step S36: randomly selecting a second preset number of individuals from the target population for variation;

step S37: judging whether a target difference value corresponding to the target population is smaller than a preset difference threshold value or not; the target difference value is calculated based on binary difference values between all individuals in the current target population and corresponding individuals in the previous generation target population; the binary difference is calculated by using binary codes corresponding to the individuals.

In a specific implementation manner, in this embodiment, the binary difference values between all individuals in the current target population and the corresponding individuals in the previous generation target population are calculated first, and then the mean value of all the binary difference values is calculated to obtain the target difference value.

In another specific implementation, the embodiment may calculate the MSE (mean squared error) of the binary code of all the individuals in the current target population and the corresponding individuals in the previous generation target population.

Step S38: if the target difference value corresponding to the target population is greater than or equal to the preset difference value threshold, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration;

step S39: if the target difference value corresponding to the target population is smaller than the preset difference value threshold, stopping iteration, and determining a first target individual from the target population to realize optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

That is, in this embodiment, the condition that the target population meets the convergence may be set that a target difference corresponding to the target population is smaller than a preset difference threshold, and if the target difference is smaller than the preset difference threshold, it is determined that the target population meets the preset convergence condition.

Referring to fig. 7, the present application discloses a specific gene sequence optimization method, including:

step S41: randomly generating an initial generation population, taking the initial generation population as an initial operation population, and performing iterative operation; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

step S42: clustering the initial operation population;

step S43: performing fitness calculation on each individual in the initial operational population by using a protein and gene interaction scoring function;

step S44: screening the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

step S45: selecting a first preset number of individual groups from the target population for cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

step S46: randomly selecting a second preset number of individuals from the target population for variation;

step S47: judging whether the difference quantity of gray code bits corresponding to the target population is smaller than a preset potential difference threshold value or not; the Gray code bit difference quantity is a potential difference quantity determined based on the Gray code bit difference between all the individuals in the current target population and the corresponding individuals in the previous generation target population.

In a specific implementation manner, in this embodiment, the difference bits having the gray code difference between all the individuals in the current target population and the corresponding individuals in the previous generation target population may be calculated first, and then the average value of all the difference bits is obtained to obtain the gray code bit difference quantity.

Step S48: if the difference quantity of the gray code bits corresponding to the target population is greater than or equal to the preset difference threshold value, determining the current target population as an initial operation population, and jumping to the step S12 to continue iteration;

step S49: if the difference quantity of the gray code bits corresponding to the target population is smaller than the preset potential difference threshold value, stopping iteration, and determining a first target individual from the target population to realize optimization of the gene sequence to be optimized; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

That is, in this embodiment, the condition that the target population meets the convergence may be set that the difference quantity of gray code bits is smaller than the preset difference threshold, and if the difference quantity of gray code bits corresponding to the target population is smaller than the preset difference threshold, it is determined that the target population meets the preset convergence condition.

Referring to fig. 8, the embodiment of the present application discloses a gene sequence optimization apparatus, including:

the initial generation population generating module 11 is configured to randomly generate an initial generation population, and perform iterative operation by using the initial generation population as an initial operation population; wherein, the individuals in the primary population are all Gray codes corresponding to the target gene sequence; the target gene sequence is a sequence consisting of a plurality of gene segments in the gene sequence to be optimized;

a population clustering module 12, configured to cluster the initial operation population;

a fitness calculation module 13, configured to perform fitness calculation on each individual in the initial operational population by using a protein-gene interaction scoring function;

the individual screening module 14 is configured to screen the individuals meeting preset conditions from the initial operation population according to the clustering result to obtain a target population;

the cross calculation module 15 is configured to select a first preset number of individual groups from the target population to perform cross calculation; wherein the group of individuals comprises two individuals, and wherein both individuals in any one of the groups of individuals belong to the same cluster;

an individual variation module 16, configured to randomly select a second preset number of individuals from the target population for variation;

a convergence judging module 17, configured to judge whether the target population reaches a preset convergence condition;

an iteration control module 18, configured to determine the current target population as an initial operation population if the convergence judgment module 17 determines that the target population does not reach the preset convergence condition, and jump to the module 12 to continue iteration;

a target individual determining module 19, configured to stop iteration and determine a first target individual from the target population to optimize the gene sequence to be optimized if the target population reaches the preset convergence condition; wherein the first target individuals comprise individuals with the highest fitness in each cluster.

In a specific embodiment, the population clustering module 12 is specifically configured to cluster the initial operation population by using a K-means algorithm.

In another specific embodiment, the population clustering module 12 is specifically configured to cluster the initial operation population by using a DBSCAN algorithm.

In a first specific embodiment, the convergence determining module 17 is specifically configured to determine whether an evolution algebra of the target population reaches a preset algebra threshold.

In a second specific embodiment, the convergence determining module 17 is specifically configured to determine whether a target difference corresponding to the target population is smaller than a preset difference threshold;

the target difference value is calculated based on binary difference values between all individuals in the current target population and corresponding individuals in the previous generation target population; the binary difference is calculated by using binary codes corresponding to the individuals.

In a third specific embodiment, the convergence determining module 17 is specifically configured to determine whether a difference quantity of gray code bits corresponding to the target population is smaller than a preset bit difference threshold;

the Gray code bit difference quantity is a potential difference quantity determined based on the Gray code bit difference between all the individuals in the current target population and the corresponding individuals in the previous generation target population.

The individual screening module 14 is specifically configured to screen a third preset number of second target individuals from each cluster; the fitness of the second target individual is higher than that of other individuals in the current cluster; after the second target individual is screened out, screening out a third target individual from other individuals in the initial operational population by using a roulette method to obtain the target population; the target population includes the second target individual and the third target individual.

Referring to fig. 9, the embodiment of the present application discloses a gene sequence optimization apparatus, which includes a processor 21 and a memory 22; wherein, the memory 22 is used for saving computer programs; the processor 21 is configured to execute the computer program to implement the gene sequence optimization method disclosed in the foregoing embodiments.

For the specific process of the gene sequence optimization method, reference may be made to the corresponding contents disclosed in the foregoing examples, which are not repeated herein.

Further, the present application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the gene sequence optimization method disclosed in the foregoing embodiments.

For the specific process of the gene sequence optimization method, reference may be made to the corresponding contents disclosed in the foregoing examples, which are not repeated herein.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The method, the device, the equipment and the medium for optimizing the gene sequence provided by the application are introduced in detail, specific examples are applied in the description to explain the principle and the implementation mode of the application, and the description of the examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种筛选乳腺癌肿瘤微环境中免疫浸润相关预后基因的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!