Molecular structure diagram retrieval method based on evolution calculation multi-view fusion

文档序号：987766 发布日期：2020-11-06 浏览：3次中文

阅读说明：本技术 一种基于演化计算多视图融合的分子结构图检索方法 (Molecular structure diagram retrieval method based on evolution calculation multi-view fusion ) 是由梁新彦郭倩钱宇华朱哲清彭甫镕于 2020-07-10 设计创作，主要内容包括：本发明涉及一种基于演化计算多视图融合的分子结构图检索方法。包括：步骤1,使用旋转、平移、缩放等操作来增强数据；步骤2,利用增强的数据训练多个深度模型作为多视图特征提取器；步骤3,利用训练好的多视图提取器提取增强数据的多视图特征；步骤4,通过演化算法搜索性能优良的多视图融合模型用于获取分子结构图分类模型和数据集的融合特征；步骤5,直接使用分类模型或者利用融合的特征计算得到的待检索图与检索库中分子图的相似性的排序值对待检索分子结构图完成检索。本发明用于解决在化学信息学领域无需依赖分子结构式编码字符,直接基于图像的化学分子结构式检索问题。(The invention relates to a molecular structure diagram retrieval method based on evolution calculation multi-view fusion. The method comprises the following steps: step 1, enhancing data by using operations such as rotation, translation, scaling and the like; step 2, training a plurality of depth models by using the enhanced data to serve as a multi-view feature extractor; step 3, extracting the multi-view characteristics of the enhanced data by using the trained multi-view extractor; step 4, searching a multi-view fusion model with excellent performance through an evolution algorithm to obtain the fusion characteristics of the molecular structure diagram classification model and the data set; and 5, completing retrieval on the molecular structure diagram to be retrieved by directly using the classification model or using the similarity ranking value of the graph to be retrieved obtained by the fused feature calculation and the molecular structure diagram in the retrieval library. The invention is used for solving the problem of retrieving the chemical molecular structural formula directly based on the image without depending on the molecular structural formula coding characters in the field of chemical informatics.)

1. A molecular structure diagram retrieval method based on evolution calculation multi-view fusion is characterized by comprising the following steps:

step 1, data enhancement: given a molecular structure diagram data set, the notation is D { (x)_i,y_i) I is more than or equal to 1 and less than or equal to D, wherein x_iShows the molecular Structure diagram, y_iDenotes x_iA category of (1); firstly, enhancing a molecular structure diagram data set D by adopting a data enhancement method; then, uniformly scaling the graph in the enhanced dataset to a size w × h, wherein: w represents an image width, h represents an image height; and finally, acquiring image copies with different reduction ratios of each image in the enhanced data set, magnifying the reduced images into w × h by filling pixels 255 at the outer margin, and recording the data set formed by all the molecular structure images generated by the process as D^*；

Step 2, training a multi-view feature extractor: given a set of deep convolutional networks N ═ { AlexNet, VGG, ResNet, densnet, google lenet, inclusion }, where N is_iRepresenting a deep convolutional network; first, using a data set D^*Replacing the number of the neurons of each network classification layer in the N by the number of the classes of the N, and taking the cross entropy as a loss function of each network; then, using the data set D^*Respectively trainingTraining 6 networks in N;

step 3, extracting a data set D^*The multi-view feature of (2): removing the last layer of all the networks in N, i.e. the classification layer, and representing the network from which the classification layer is removed as G ═ G_iI is more than or equal to 1 and less than or equal to N) as a multi-view feature extractor; each network G in G_iCan extract the data set D^*A view feature of, denoted asWhere v denotes a view number and n denotes a total number of pictures; through this process, a data set D is obtained^*The 6 view features of (a), denoted as V ═ V₁,V₂,V₃,V₄,V₅,V₆}；

Step 4, searching a satisfactory multi-view fusion model through an evolution algorithm

Step 4.1, parameter agreement: the population size is denoted T; the t generation population is represented as P_t＝{p_iI is more than or equal to 1 and less than or equal to T, wherein p_iRepresenting the ith individual in the population; fusion operator set F ═ { F) for fusing two view features_iI is more than or equal to 1 and less than or equal to | F | }, and the total number of fusion operators is recorded as | F |;

step 4.2, individual coding: individual p in a population_iThe vector T of which i is more than or equal to 1 and less than or equal to 1 encodes the number of the views participating in the fusion and the fusion operator used for fusing the views, and the vector p_iIs recorded as 2| Vⁱ1, wherein VⁱI denotes an individual p_iThe number of views participating in fusion; p is a radical of_iFront | V ofⁱL elements are used for coding view numbers participating in fusion, and the value p of the partial elements_i[j](1≤j≤|VⁱL) are different from each other and 1. ltoreq. p_i[j]≤|Vⁱ|。p_i[j]Rear | V ofⁱI-1 element for encoding the fused operator for view fusion, the value p of each element of the part_i[j](1≤p_i[j]≤|F|,|Vⁱ|+1≤j≤2|VⁱI-1) denotes the use of p-th in F_i[j]Fusing the previous fusion result with the j-th V by using fusion operatorsⁱAn | +1 view;

step 4.3, individual decoding: each individual p_iCan be decoded into a multi-view convergence network; the specific process is as follows: if 2| Vⁱ1-1, the individual contains only one view and no fusion operation needs to be performed, i.e. a fused networkOtherwise, p is obtained according to formulae (1) and (2)_iCorresponding converged network

Wherein: input (units) represents a full-connection layer, the input and the units represent two parameters of the full-connection layer, the input is a characteristic to be input, and the units are the number of neurons in the layer;

finally, the output of the converged network is fusedMapping to the category space using equation (3),

wherein: classes denotes dataset D^*The total number of molecular structure classes of (a);

step 4.4, population initialization: randomly generating T individuals, denoted P, according to step 4.2₀＝{p_iI is more than or equal to 1 and less than or equal to T. According to step 2(3) adding P₀Each individual in the multi-view convergence network is decoded into a multi-view convergence network;

step 4.5, fitness value function: training each multi-view fusion network by minimizing cross entropy loss, and calculating a fitness value of each individual by using an equation (3);

wherein: y is the real category of the sample x, the category of pre _ y fusion multi-view network prediction, namely the category corresponding to the maximum probability value in the output of the multi-view fusion network, I (·) represents an indicative function, when the condition is true, the function value is 1, and if the condition is not 0;

and 4.6, generating a next generation population through selection, crossing and variation: set Q defining a temporary deposit population_t＝φ；

And (3) a crossing process: random previous generation population P_tSelecting two individuals, and selecting the individual with the highest fitness value from the two individuals, and recording the individual as p₁Repeating the process to select the individual p₂Randomly generating a random number r within the range of 0-1, and if r is less than a pre-specified cross probability p_cRespectively at p₁,p₂Randomly selecting a position i, j (i is more than or equal to 1 and less than or equal to | V)¹|，1≤j≤|V²With the aid of i, p₁Will be at positions i, | V¹I and I V¹+ i +1| is divided into four parts, each denoted as [ p₁[1],...,p₁[i]]，[p₁[i+1],...,p₁[|V¹|]]，[p₁[|V¹|+1],...,p₁[|V¹|+i-1]]And [ p ]₁[|V¹|+i],...,p₁[2|V¹|-1]](ii) a In the same way, p₂Will be at positions j, | V²I and I V²The | + j +1 is divided into four parts, which are respectively expressed as [ p₂[1],...,p₂[j]]，[p₂[j+1],...,p₂[|V²|]]And [ p ]₂[|V²|+j],...,p₂[2|V²|-1]]. Production of p by formulae (4) and (5)₁And p₂The progeny of (a);

to the offspring individuals o respectively₁And o₂Two-fold de-duplication of view in (1) with individual o₁For example, assume at o₁In an individual, a view number appears twice, the number position appearing for the second time is recorded as i, and o is deleted₁Middle o₁[i]And o₁[|V¹|+i-1]Two elements, repeat the process until o₁No repeated view number appears in the list, and the same de-duplication method is adopted to carry out de-duplication on the individual o₂Performing de-duplication operation to obtain de-duplicated offspring₁And o₂Is stored to Q_tIf r is equal to or greater than a pre-specified crossover probability p_cThe individual p₁And p₂Is stored to Q_tIn (1), repeating the above steps to obtain Q_tThe total number of the produced individuals is not less than T individuals;

and (3) mutation process: to Q_tEach individual performing the steps of: randomly generating a random number r within the range of 0-1, if r is less than a pre-specified variation p_mRandomly selecting a position in the individual, recording the position as i, and randomly generating a view number to replace the view number of the position if i is less than or equal to | V |; if i > | V |, randomly selecting one fusion operator in F to replace the fusion operator at the position;

the selection process comprises the following steps: defining a next generation population set P_t+1Phi is defined as; from P_tAnd Q_tCombined set of individuals P_t∪Q_tTwo individuals were randomly selected, and are denoted as p₁And p₂Putting the individual with the maximum fitness value in the two individuals into P_t+1(ii) a This process is repeated until P_t+1The number of individuals in the composition is not less than P_t(ii) a Find P_t∪Q_tThe individual with the greatest fitness value in (1) is recorded as p_bestIf p is_bestIs not at P_t+1In then, p is used_bestBy replacing P_t+1The individual with the smallest fitness value; p is decoded individually according to the step 4.3_t+1Decoding into corresponding multi-view fusion network, and then sequentially performing the steps 4.5Calculating the fitness value of each multi-view fusion network;

step 4.7, repeat step 4.6N times, choose from P_NThe model with the maximum fitness value determined by the individual is used as a final fusion model and is expressed as EF; setting an individual sharing pool (marked as P) in the whole process of model evolution_share) Avoid the same individual from calculating repeatedly; storing all individual codes generated in the evolution process into P in the form of character strings_share(ii) a The newly generated individual P is judged whether to exist in P before training_shareIf present, directly combining P_shareAssigning the fitness value corresponding to the middle individual to p; otherwise, decoding the model into a corresponding multi-view fusion model, and then obtaining the fitness value of the model by training the model;

and 5, providing two retrieval modes based on the EF model: the retrieval problem is used as an ultra-large-scale classification problem, class distribution probability of the retrieval problem is obtained by directly inputting EF of the graph to be retrieved, the output probability is sorted in a descending order, and a molecular structure graph corresponding to the first K values is output; second, first, the last layer of EF, i.e., the classification layer, is removed and denoted as EF; then, the database D is sequentially connected^*Inputting the middle graph into EF, and taking the output of the last layer as the characteristic of the corresponding graph; inputting a graph to be searched to EF, wherein the output of EF is used as the characteristic of the searched graph; using the characteristics of the graph to be searched and D^*Sequentially calculating the graph to be retrieved and D according to the characteristics of the middle graph^*Cosine similarity of all molecular structure diagrams; and sorting the calculated similarity values in a descending order, and outputting the molecular structure diagram corresponding to the first K values.

2. The method for retrieving a molecular structure diagram based on evolution-based computation multi-view fusion as claimed in claim 1, wherein: the data enhancement method adopts any one of up-down turning, left-right turning, random rotation, shifting, zooming, clipping, translation, contrast adjustment, brightness adjustment, chroma adjustment, saturation adjustment, Gaussian blur, sharpening, Gaussian noise addition, salt and pepper noise addition, Poisson noise addition and multiplicative noise addition.

3. The method for retrieving a molecular structure diagram based on evolution-based computation multi-view fusion as claimed in claim 1, wherein: the deep convolution network group adopts any one network of AlexNet, ZF-Net, VGG, NiN, ResNet, DenseNet, GoogleNet and inclusion.

Technical Field

The invention relates to the field of chemical molecular structure retrieval of chemical informatics, in particular to a molecular structure diagram retrieval method based on evolution calculation and multi-view fusion.

Background

The retrieval of chemical structural formula is one of the core works in the field of chemical informatics, and is a chemical information searching mode which takes input chemical molecular structure graph as retrieval content. The method is a search commonly used by current chemists in scientific research or ordering chemical reagents.

At present, the chemical structure retrieval mainly adopts a method of pre-defined molecular structure coding. For example, based on the structural formula retrieval of simplified molecular linear input specification (SMILES), the method firstly needs to break the chemical structure into fragments represented by symbols, then arrange the fragments into a long string to form a linear code of the chemical structure, and then adopt a character string comparison strategy to realize the molecular structure retrieval. The SMILES-based retrieval is the most adopted method in the structural formula retrieval methods mastered in China at present. However, this approach may fail in the face of irregular or complex SMILES matching.

The method based on molecular structure coding retrieval needs to perform character coding on all molecular structure diagrams according to a coding mode designed in advance, and the process is time-consuming and labor-consuming and is easy to mark errors. Designing a code that does not depend on expert rules is necessary and important to directly use the molecular structure diagram as a retrieval object. In the field of computer vision, deep learning has been successful in face recognition, object classification, and the like as one of the most successful expression learning methods at present. It is feasible to automatically establish the feature representation of the molecular structure diagram by using the existing deep learning model. Effective feature representation plays a core and fundamental role in molecular structure retrieval performance, however, a single depth model cannot capture the features well. If the existing different-depth models can be used for respectively extracting the features of different visual angles of the models and the features of the different visual angles are fused in a certain mode, the method has important significance for the retrieval of the molecular structural formula based on the graph. The method comprises the steps of firstly, carrying out operations such as rotation and scaling on a data set to enhance the data set, then carrying out multi-view feature extraction on the enhanced data by means of a plurality of existing depth models, and then, aiming at a fusion mode of the extracted multi-view features, providing a molecular structure diagram retrieval method based on an evolution calculation multi-view fusion model.

Disclosure of Invention

The invention aims to provide a molecular structure diagram retrieval method based on evolution calculation multi-view fusion.

The technical scheme adopted by the invention is as follows: a molecular structure diagram retrieval method based on evolution calculation multi-view fusion comprises the following steps:

Step 2, training a multi-view feature extractor: given a set of deep convolutional networks N ═ { AlexNet, VGG, ResNet, densnet, google net, inclusion }; first, using a data set D^*Replacing the number of the neurons of each network classification layer in the N by the number of the classes of the N, and taking the cross entropy as a loss function of each network; then, using the data set D^*Respectively training 6 networks in N;

step 3, extracting a data set D^*The multi-view feature of (2): remove last of all networks in NA layer, i.e. a classification layer, a network from which the classification layer is to be removed, denoted G ═ G_iI is more than or equal to 1 and less than or equal to N) as a multi-view feature extractor; each network G in G_iCan extract the data set D^*A view feature of, denoted asWhere v denotes a view number and n denotes a total number of pictures; through this process, a data set D is obtained^*The 6 view features of (a), denoted as V ═ V₁,V₂,V₃,V₄,V₅,V₆}；

Step 4, searching a satisfactory multi-view fusion model through an evolution algorithm

step 4.3, individual decoding: each individual p_iCan be decoded into a multi-view convergence network; the specific process is as follows: if 2| Vⁱ1| -1 ═ 1, this individual aloneInvolving a view, without performing a convergence operation, i.e. converging the networkOtherwise, p is obtained according to formulae (1) and (2)_iCorresponding converged network

finally, the output of the converged network is fusedMapping to the category space using equation (3),

wherein: classes denotes dataset D^*The total number of molecular structure classes of (a);

step 4.5, fitness value function: training each multi-view fusion network by minimizing cross entropy loss, and calculating a fitness value of each individual by using an equation (3);

and 4.6, generating a next generation population through selection, crossing and variation: set Q defining a temporary deposit population_t＝φ；

to the offspring individuals o respectively₁And o₂Two times of view de-duplication to individualo₁For example, assume at o₁In an individual, a view number appears twice, the number position appearing for the second time is recorded as i, and o is deleted₁Middle o₁[i]And o₁[|V¹|+i-1]Two elements, repeat the process until o₁No repeated view number appears in the list, and the same de-duplication method is adopted to carry out de-duplication on the individual o₂Performing de-duplication operation to obtain de-duplicated offspring₁And o₂Is stored to Q_tIf r is equal to or greater than a pre-specified crossover probability p_cThe individual p₁And p₂Is stored to Q_tIn (1), repeating the above steps to obtain Q_tThe total number of the produced individuals is not less than T individuals;

the selection process comprises the following steps: defining a next generation population set P_t+1Phi is defined as; from P_tAnd Q_tCombined set of individuals P_t∪Q_tTwo individuals were randomly selected, and are denoted as p₁And p₂Putting the individual with the maximum fitness value in the two individuals into P_t+1(ii) a This process is repeated until P_t+1The number of individuals in the composition is not less than P_t(ii) a Find P_t∪Q_tThe individual with the greatest fitness value in (1) is recorded as p_bestIf p is_bestIs not at P_t+1In then, p is used_bestBy replacing P_t+1The individual with the smallest fitness value; p is decoded individually according to the step 4.3_t+1Decoding the network into a corresponding multi-view fusion network, and then sequentially calculating the fitness value of each multi-view fusion network according to the step 4.5;

step 4.7, repeat step 4.6N times, choose from P_NThe model with the maximum fitness value determined by the individual is used as a final fusion model and is expressed as EF; in thatSetting an individual sharing pool (marked as P) in the whole process of model evolution_share) Avoid the same individual from calculating repeatedly; storing all individual codes generated in the evolution process into P in the form of character strings_share(ii) a The newly generated individual P is judged whether to exist in P before training_shareIf present, directly combining P_shareAssigning the fitness value corresponding to the middle individual to p; otherwise, decoding the model into a corresponding multi-view fusion model, and then obtaining the fitness value of the model by training the model;

The further scheme of the technical scheme is that the data enhancement method adopts the methods of up-down turning, left-right turning, random rotation, shifting, zooming, clipping, translation, contrast adjustment, brightness adjustment, chroma adjustment, saturation adjustment, Gaussian blur, sharpening, Gaussian noise addition, salt-pepper noise addition, Poisson noise addition and multiplicative noise addition.

The further scheme of the technical scheme is that the group of deep convolution networks adopts networks of AlexNet, ZF-Net, VGG, NiN, ResNet, DenseNet, GoogleNet and inclusion.

The invention has the following advantages:

the first and the whole processes only use the molecular structure chart, so that the problems that the traditional molecular structure retrieval method needs to carry out complex character coding on the molecular structure chart such as Molfile and SMILES, the coding process is easy to make mistakes, and wrong coding seriously influences retrieval are avoided.

And secondly, extracting multi-view characteristics of the molecular structure diagram by using different deep convolutional networks, automatically selecting useful views and an optimal fusion mode between the useful views by adopting an evolution calculation method, without excessive human participation, and being easy to use and convenient to retrieve.

And thirdly, the searching process algorithm can be deployed in GPU and TPU hardware, has consistent searching speed for simple and complex molecular structural formula searching, and ensures high-efficiency searching efficiency.

Drawings

FIG. 1 is an overall flow of a molecular structure diagram retrieval method based on evolution-based computational multi-view fusion;

FIG. 2 is an overall framework of the molecular structure diagram retrieval method based on evolution-computed multi-view fusion;

FIG. 3 is an individual p₁And p₂Generation of offspring by crossing₁And o₂；

Fig. 4 is a multi-view fusion network of six views to be fused and their individual encoding.

Detailed Description

As shown in fig. 1 to 4, a method for retrieving a molecular structure diagram based on evolution-based multi-view fusion includes the following steps:

In the step 2, the step of mixing the raw materials,training a multi-view feature extractor: given a set of deep convolutional networks N ═ { AlexNet, VGG, ResNet, densnet, google net, inclusion }; first, using a data set D^*Replacing the number of the neurons of each network classification layer in the N by the number of the classes of the N, and taking the cross entropy as a loss function of each network; then, using the data set D^*Respectively training 6 networks in N;

Where v denotes a view number and n denotes a total number of pictures; through this process, a data set D is obtained^*The 6 view features of (a), denoted as V ═ V₁,V₂,V₃,V₄,V₅,V₆}；

Step 4, searching a satisfactory multi-view fusion model through an evolution algorithm

step 4.2, individual coding: individual p in a population_iThe vector T of which i is more than or equal to 1 and less than or equal to 1 encodes the number of the views participating in the fusion and the fusion operator used for fusing the views, and the vector p_iIs recorded as 2| Vⁱ1, wherein VⁱI denotes an individual p_iThe number of views participating in fusion; p is a radical of_iFront | V ofⁱL elements are used for coding view numbers participating in fusion, and the value p of the partial elements_i[j](1≤j≤|VⁱL) are different from each other and 1. ltoreq. p_i[j]≤|Vⁱ|。p_i[j]Rear | V ofⁱ|-1 element for coding the view fusion fused operator, the value p of each element of the part_i[j](1≤p_i[j]≤|F|,|Vⁱ|+1≤j≤2|VⁱI-1) denotes the use of p-th in F_i[j]Fusing the previous fusion result with the j-th V by using fusion operatorsⁱAn | +1 view;

finally, the output of the converged network is fused

Mapping to the category space using equation (3),

wherein: classes denotes dataset D^*The total number of molecular structure classes of (a);

step 4.4, population initialization: randomly generating T individuals, denoted P, according to step 4.2₀＝{p_i|1I is not less than or equal to T. According to step 2(3) adding P₀Each individual in the multi-view convergence network is decoded into a multi-view convergence network;

step 4.5, fitness value function: training each multi-view fusion network by minimizing cross entropy loss, and calculating a fitness value of each individual by using an equation (3);

and 4.6, generating a next generation population through selection, crossing and variation: set Q defining a temporary deposit population_t＝φ；

o₁＝[p₁[1],...,p₁[i],p₂[j+1],...,p₂[|V²|],p₁[|V¹|+1],

...,p₁[|V¹|+i-1],p₂[|V²|+j],...,p₂[2|V²|-1]](4)

o₂＝[p₂[1],...,p₂[j],p₁[i+1]，...，p₁[|V¹|],p₂[|V²|+1],

...,p₂[|V²|+j-1]],p₁[|V¹|+i],...,p₁[2|V¹|-1]](5)

the selection process comprises the following steps: defining a next generation population set P_t+1Phi is defined as; from P_tAnd Q_tCombined set of individuals P_t∪Q_tIn the random selection of twoIndividual, denoted as p₁And p₂Putting the individual with the maximum fitness value in the two individuals into P_t+1(ii) a This process is repeated until P_t+1The number of individuals in the composition is not less than P_t(ii) a Find P_t∪Q_tThe individual with the greatest fitness value in (1) is recorded as p_bestIf p is_bestIs not at P_t+1In then, p is used_bestBy replacing P_t+1The individual with the smallest fitness value; p is decoded individually according to the step 4.3_t+1Decoding the network into a corresponding multi-view fusion network, and then sequentially calculating the fitness value of each multi-view fusion network according to the step 4.5;

The further scheme of the above technical scheme is that the data enhancement method adopts any method of up-down turning, left-right turning, random rotation, shifting, zooming, clipping, translation, contrast adjustment, brightness adjustment, chroma adjustment, saturation adjustment, Gaussian blur, sharpening, Gaussian noise addition, salt-pepper noise addition, Poisson noise addition and multiplicative noise addition.

The further scheme of the technical scheme is that the group of deep convolutional networks adopts any network of AlexNet, ZF-Net, VGG, NiN, ResNet, DenseNet, GoogleNet and inclusion.

Experimental results show that the method can automatically provide the multi-view fusion model, and effectively improve the retrieval precision of the molecular structure based on the graph.

12页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种高光谱遥感图像分类方法及系统

Molecular structure diagram retrieval method based on evolution calculation multi-view fusion

相关技术

网友询问留言