Molecular property prediction method and system

文档序号:1100153 发布日期:2020-09-25 浏览:21次 中文

阅读说明:本技术 一种分子性质预测方法及系统 (Molecular property prediction method and system ) 是由 马英晋 马硕 张宝花 刘倩 �金钟 于 2020-05-13 设计创作,主要内容包括:本发明提出一种分子性质预测方法及系统,涉及量子化学/计算化学、化学信息学、机器学习/人工智能领域,在化学多世界阐释的框架下,使用密度泛函理论、化学信息学、机器学习/人工智能的手段,以分子结构、基组和泛函等信息作为输入,通过机器学习模型输出分子性质的预测结果。本发明对于任意类型的分子结构和任意的计算策略都可以做出预测,比一般的经验方法、回归分析方法更加精确。(The invention provides a molecular property prediction method and a molecular property prediction system, which relate to the fields of quantum chemistry/computational chemistry, chemical informatics, machine learning/artificial intelligence. The invention can make prediction for any type of molecular structure and any calculation strategy, and is more accurate than general empirical methods and regression analysis methods.)

1. A molecular property prediction method, comprising the steps of:

selecting convergence strategies, functional and base groups of molecules with a plurality of known structures as training data to train a machine learning model, wherein the machine learning model comprises one or more of a random forest RF model, a bidirectional long-short term memory network Bi-LSTM model, a message passing neural network MPNN model and a multilayer graph convolution neural network MGCN model;

inputting the molecular structure information, the convergence strategy, the functional and the basis set of the molecule to be predicted as input information into a machine learning model for predicting the molecular property, wherein the method comprises the following steps:

inputting molecular structure information, functional types and basis functions of molecules into an RF model for prediction;

inputting the molecular structure information, functional types and basis functions of the one-hot form into a Bi-LSTM model for prediction;

inputting the molecular structure information into an MPNN model for prediction;

inputting the molecular structure information into an MGCN model for prediction;

predicting the properties of the molecules through the machine learning model to serve as a preliminary prediction result;

if the functional and the base group in the input information belong to a known space, the known space is a result space corresponding to the selected functional and the base group during model training, and the result space is a state set of a molecule with a given structure after the functional and the base group are determined, taking a preliminary prediction result as a final prediction result, otherwise, adopting an approximation strategy to infer the property of the molecule according to the preliminary prediction result to obtain the final prediction result.

2. The method of claim 1, wherein the molecular structural information comprises a SMILES code.

3. The method of claim 2, wherein the step of the RF model predicting the molecular property comprises:

receiving SMILES codes, functional types and basis functions of molecules;

counting the number of atoms, branched chains, atoms on branched chains, rings, atoms on rings and double bonds in the molecules based on SMILES (simple object extraction) codes of the molecules, splicing the atoms, branched chains, rings and double bonds into a vector representing the structural characteristics of the molecules, and sending the vector to a random forest classifier;

the random forest classifier gives probabilities that input molecular structures respectively belong to five typical structures, namely a linear structure, a branched chain structure, a cyclic structure, a linear chain olefin structure and a polyphenyl structure;

based on the number of basis functions, utilizing five pre-trained feedforward neural network models respectively corresponding to the five typical structures to respectively predict molecular property parameters;

and superposing the molecular property parameters predicted by the five models to obtain the predicted molecular property.

4. The method of claim 2, wherein the step of predicting the molecular property by the Bi-LSTM model comprises:

receiving SMILES codes, functional types and basis functions in a one-hot form;

pre-training a weight matrix by using a word2vec algorithm, converting the SMILES code in the form of one-hot into a real number vector by using the matrix, and sending the real number vector into a bidirectional LSTM layer;

extracting structural features contained in SMILES through a bidirectional LSTM layer to obtain a forward output vector and a backward output vector;

taking the sum of two output vectors of the bidirectional LSTM layer as input through the Attention layer, and outputting a new vector after processing;

and sending the new vector and the new basis function into a full-connection network for fitting to obtain the predicted molecular property.

5. The method of claim 1, wherein the step of predicting molecular properties by the MPNN model comprises:

modeling a molecule into a graph G form according to molecular structure information, and combining a vertex vector set and an edge vector set of the graph G as input, wherein a component of each vertex vector stores an atom type of an atom corresponding to the vertex, whether the atom type is on an aromatic ring and a hybridization type, and each edge vector stores a type of a chemical bond corresponding to the edge;

the vertex vector is subjected to point embedding through a layer of vertex network and is converted into an n x d dimensional matrix, wherein n vertex numbers are obtained, and d is the dimension of a hidden layer node of the vertex network;

performing convolution operation for multiple times on the point embedding to obtain high-level characteristic representation of the graph G;

and (4) sending the high-level feature representation and the functional and basis function information into a full-connection network for fitting to obtain the predicted molecular property.

6. The method of claim 5, wherein the performing a plurality of convolution operations on the point embedding results in a high level feature representation of graph G by:

the t-th convolution operation is performed by the message function MtAnd a vertex update function UtDefining, hidden states of vertices v

Figure FDA0002490266480000021

where N (v) represents the neighborhood of vertex v, MtIs defined as M (h)v,hw,eew)=A(evw)hwW denotes a vertex, hwIndicating a hidden state of the vertex w, A (e)vw) Is an edge network, evwRepresenting an edge vector formed by connecting the vertices v and w, and an edge vector evwVertex update function U mapped as a matrix of d × dtIs a gated cyclic unit;

high level feature representation of graph G Using set2set model R

Figure FDA0002490266480000025

7. The method of claim 1, wherein the step of the MGCN model predicting the property of the molecule comprises:

modeling a molecule into a form of graph G according to molecular structure information, and combining a vertex vector set and an edge vector set of the graph G as input, wherein a component of each vertex vector stores an atom type of an atom corresponding to the vertex, whether the atom type is on an aromatic ring and a hybridization type, and each edge vector stores a type and a bond length of a chemical bond corresponding to the edge;

converting the vertex vector set and the edge vector set into a vertex embedding matrix and an edge embedding matrix, and converting the bond length into a distance tensor, wherein the components of the tensor represent the distance between atoms;

obtaining a high-level feature representation of the graph G by using the interaction layer constructed in the form of a hierarchical structure;

and sending the high-level feature representation and the number of the basis functions into a full-connection network for fitting to obtain the predicted molecular property.

8. The method of claim 7, wherein the high level feature representation of graph G is obtained using the interaction level by: recording the edge state output by the first layer of interaction layer as

Figure FDA0002490266480000031

wherein N represents all molecules in all molecular systems, dijRepresents the distance between atoms i and j; h iseIs an edge state update function, the concrete form is:

Figure FDA0002490266480000035

wherein η is a constant, WueIs a matrix of weights that is a function of,represents an element-by-element addition, ⊙ represents an element-by-element multiplication;

wherein h isvThe vertex state updating function has the concrete form:

Figure FDA0002490266480000037

wherein v is a vertex representing an atom in the graph, and u is a point inside a Gaussian radial basis; f represents a function, fa, fd, fe represent functions related to vertex, distance, and edge, respectively; m (x) represents a layer of linear network, which is a fully connected layer, and has the form of m (x) ═ Wx + b, W is a weight matrix, x represents the content in the small brackets behind W, and b represents a constant term;

output of T interaction layersWith the initial vertex stateSpliced to obtain a vector ai

The high-level features of graph G represent G:

Figure FDA00024902664800000310

wherein r represents a weight matrix in the last readout layer of the MGCN model; σ represents the softplus function, which is the activation function.

9. The method of claim 1, wherein the approximation strategy includes two types: a similar spatial strategy and a hyperplane strategy;

the similar spatial strategy is: for the condition that a base set or functional of a molecule is known and unknown, based on base set similarity or functional similarity, the base set similarity means that the number of base functions of two different types of base sets is the same, the functional similarity means that the two functional belong to the same class, for a given input molecule, if the base set is known and the functional is unknown, a space which is the same as the base set and similar to the functional can be found in a known space, and a corresponding machine learning model is directly called to predict the properties of the molecule;

the hyperplane policy is: for the condition that both the basis set and the functional of the molecule are unknown, the same kind of molecule is used as a ligament molecule to form a hyperplane space, and the gradient relation of the property characteristics of the same molecule in different result spaces is deduced in the hyperplane by using a simple fitting or machine learning method; averaging out the gradient relation of the most common molecular property characteristics in different result spaces through a plurality of ligament molecules; based on the gradient relationship, the properties of the molecules to be predicted in any result space are deduced through machine learning model data.

10. A molecular property prediction system, comprising:

the input module is responsible for inputting the molecular structure information, the convergence strategy, the functional and the base set of the molecule to be predicted as input information;

the prediction module comprises a machine learning model, wherein the machine learning model comprises one or more of a random forest RF model, a bidirectional long-short term memory network Bi-LSTM model, a message passing neural network MPNN model and a multilayer graph convolution neural network MGCN model; the method comprises the steps of selecting convergence strategies, functional and base groups of molecules with known structures as training data to train a machine learning model, inputting input information into the machine learning model to predict molecular properties, and outputting a preliminary prediction result; the RF model predicts according to molecular structure information, functional types and basis functions of molecules, the Bi-LSTM model predicts according to molecular structure information, functional types and basis functions of a single-hot form, the MPNN model predicts according to the molecular structure information, and the MGCN model predicts according to the molecular structure information;

the scheduling module is responsible for transmitting the input information to the prediction module, judging a preliminary prediction result output by the prediction module and transmitting an obtained final prediction result to the output module; the judgment is as follows: if the functional and the base group in the input information belong to a known space, the known space is a result space corresponding to the selected functional and the base group during model training, and the result space is a state set of a molecule with a given structure after the functional and the base group are determined, taking a preliminary prediction result as a final prediction result, otherwise, adopting an approximation strategy to infer the property of the molecule according to the preliminary prediction result to obtain the final prediction result;

and the output module is responsible for outputting the final prediction result of the molecular property.

Technical Field

The invention relates to the fields of quantum chemistry/computational chemistry, chemical informatics, machine learning/artificial intelligence, in particular to a theoretical method for predicting unknown molecular properties by means of density functional theory, chemical informatics, machine learning/artificial intelligence under a chemical multi-world theoretical framework.

Background

Calculation of various intrinsic properties of molecules is one of the core problems in quantum chemistry/computational chemistry. Early solutions were based on empirical, semi-empirical, model hamiltonian theory to solve, such as houcker molecular orbital theory, classical valence bond theory. Starting in the last 90 th century, with the rapid development of computer software and hardware, theoretical methods based on ab initio hamiltonian occupy a major position, and the theoretical methods comprise Hartree-Fock self-consistent field theory, an electronic correlation method based on Hartree-Fock wave function, density functional theory, a method based on Green function and the like. Compared with Hartree-Fock and the like based on wave function theory, the biggest difference of the density functional theory is to use electron density to replace the wave function, and on the basis, the electron behavior in the system is solved. Since the multiple electron wave function has 3n variables (n is the number of electrons, each electron contains three spatial variables), whereas the electron density is a function of only three variables, which is more convenient to handle conceptually and practically. Although it was generally believed earlier that the density functional theory could not give sufficiently accurate results in quantum chemical calculations, the calculation accuracy of the density functional method has been greatly improved as the approximation used in the density functional theory is re-refined into a better exchange correlation functional. The current density functional theory method has lower theoretical calculation scale (N)3-4N is the system size) and more reliable accuracy, has been developed in the field of computational chemistry as the most popular theoretical calculation method for medium and small scale molecular systems.

In general, the density functional method computation requires at least the selection of a basis set model corresponding to the exchange correlation functional and the molecular atoms. However, there are at least hundreds of types of cross-correlation functionals, and the number of basis sets (the combination of basis functions used for a certain atom, which are independent basis vectors used when molecular orbitals are linearly expanded) is even more than that of the functionals, and both the functionals and the basis sets have certain customizability. Therefore, the combination of the basis set and the functional in the density functional calculation can be regarded as infinite, which also leads scientific researchers to select different basis sets and functional to carry out calculation tests in the calculation process of relevant properties; meanwhile, the calculation results obtained under a specific functional and base set cannot be directly extrapolated to the other functional and base set combinations. The inconvenience greatly affects the working efficiency of the scientific research workers in the aspect.

Disclosure of Invention

The invention aims to provide a molecular property prediction method and a molecular property prediction system, which are used for predicting various properties of unknown molecules by means of density functional theory, chemical informatics, machine learning/artificial intelligence under a chemical multi-world theory framework.

In order to achieve the purpose, the invention adopts the following technical scheme:

a molecular property prediction method, comprising the steps of:

the convergence strategy, functional and basis set of molecules of a plurality of known structures are selected as training data to train a machine learning model: one or more of a Random Forest (RF) model, a bidirectional long-short term memory network (Bi-LSTM) model, a Message Passing Neural Network (MPNN) model, and a multi-layer graph convolutional neural network (MGCN) model;

inputting molecular structure information (such as SMILES coding), convergence strategy, functional and basis set of a molecule to be predicted into a machine learning model for predicting molecular properties, wherein the method comprises the following steps:

inputting molecular structure information, functional types and basis functions of molecules into an RF model for prediction;

inputting the molecular structure information, functional types and basis functions of the one-hot form into a Bi-LSTM model for prediction;

inputting the molecular structure information into an MPNN model for prediction;

inputting the molecular structure information into an MGCN model for prediction;

predicting the properties of the molecules through the machine learning model to serve as a preliminary prediction result;

if the functional and the base group in the input information belong to a known space, taking the preliminary prediction result as a final prediction result, and otherwise, adopting an approximation strategy to infer the properties of the molecules according to the preliminary prediction result to obtain the final prediction result; the known space is a result space corresponding to the selected functional and the basis set when the model is trained, and the result space is a state set of a molecule with a given structure after the functional and the basis set are determined.

A molecular property prediction system comprises an input module, a prediction module, a scheduling module and an output module, wherein,

the input module is responsible for inputting the molecular structure information, the convergence strategy, the functional and the base set of the molecule to be predicted as input information;

a prediction module comprising a machine learning model: one or more of an RF model, a Bi-LSTM model, an MPNN model, and an MGCN model; the method comprises the steps of selecting convergence strategies, functional and base groups of molecules with known structures as training data to train a machine learning model, inputting input information into the machine learning model to predict molecular properties, and outputting a preliminary prediction result; the RF model predicts according to molecular structure information, functional types and basis functions of molecules, the Bi-LSTM model predicts according to molecular structure information, functional types and basis functions of a single-hot form, the MPNN model predicts according to the molecular structure information, and the MGCN model predicts according to the molecular structure information;

the scheduling module is responsible for transmitting the input information to the prediction module, judging a preliminary prediction result output by the prediction module and transmitting an obtained final prediction result to the output module; the judgment is as follows: if the functional and the base group in the input information belong to a known space, taking the preliminary prediction result as a final prediction result, and otherwise, adopting an approximation strategy to infer the properties of the molecules according to the preliminary prediction result to obtain the final prediction result; the known space is a result space corresponding to the selected functional and the basis set during model training, and the result space is a state set of a molecule with a given structure after the functional and the basis set are determined;

and the output module is responsible for outputting the final prediction result of the molecular property.

The method has the advantages that: under the framework of the chemical multi-world explanation provided by the invention, information such as a molecular structure, a basis set, a functional and the like is received as input, a prediction result of molecular properties is output, and the prediction can be made on any type of molecular structure and any calculation strategy, so that the method is more accurate than a general empirical method and a regression analysis method.

Drawings

Fig. 1 is an overall architecture diagram of an intelligent prediction system.

Fig. 2 is a chemical multi-world illustration under the theory of density functional.

Fig. 3 is a schematic diagram of RF.

FIG. 4 is a diagram showing a model structure of Bi-LSTM.

Fig. 5 is a model structure diagram of MPNN.

Figure 6 is a diagram of a model architecture of the MGCN.

FIG. 7 is a flow chart of the behavior of the scheduling module.

FIG. 8 is a schematic diagram of a similar spatial strategy and a hyperplane strategy.

Detailed Description

The invention is inspired by multi-world interpretation (MWI) in quantum mechanics, provides the chemical multi-world interpretation (the chemical MWI) under the density functional theory, and is combined with chemical informatics, machine learning/artificial intelligence to predict the molecular properties under different computing schemes (exchange correlation functional, basis set) combination.

The multi-world interpretation was proposed in 1957 by Everett Hugh III of princeton university, who assumed that the evolution of all isolated systems followed schrodinger's equation and the wave function did not collapse, while the measurement of quanta could only yield one result, i.e. the quanta were in a stacked state. He believes that there is some correlation between the measurement and the system being measured, called the relative state; it also considers that the measurements do not bring about a collapse, but rather a split world. In the 1960-1970 s, the theory was newly proposed by Bryce DeWitt, university of Texas, and became one of the hot topics in the physical world.

In the chemical multi-world interpretation proposed by the invention, the combination of different basis sets and functionals is used as a critical condition for splitting to generate different worlds, assuming that the Kohn-Sham equation, chemical component set and the like required to be solved in the density functional theory are unique starting points. Every division world includes various intrinsic properties of molecules calculated under the density functional theory such as specific functional, basis set and the like, such as wave function, electronegativity, orbital energy level, oscillator intensity, computer time and any other property characteristics connected with the molecules.

Under the framework of the chemical multi-world explanation provided by the invention, the invention further provides a molecular property prediction method and a molecular property prediction system combining chemical informatics and machine learning/artificial intelligence. The method and the system receive the molecular structure and the adopted calculation strategy (combination of the basis set and the functional) as input, output the prediction result of the molecular property, can predict any type of molecular structure and any calculation strategy, and are more accurate than common empirical methods and regression analysis methods.

The present embodiment provides a molecular property prediction system, which realizes prediction of molecular properties by combining a molecular property prediction method, and the system can be divided into four modules: the overall system architecture is shown in the attached figure 1, and the modules are specifically described as follows.

(1) Input module

The module is responsible for receiving user input information including molecular structure files, computational strategies (computational methods), convergence strategies (e.g., quasi-newtonian methods, steepest descent methods), models to be used, and passing this information to the scheduling module.

(2) Prediction module

Four types of machine learning/artificial intelligence models are built in the module, namely a random forest + feedforward with fed forward neural network (RF for short), a bidirectional long short term memory network model with attention mechanism (Bi-LSTM for short), a message passing neural network model (MPNN for short), a multi-level graph volume neural network model (MGCN for short), and any one or more models are selected according to needs. After the four models are trained, various properties of corresponding molecules can be predicted according to the molecular structures and the number of basis functions.

The principles of the four models are as follows:

a) the structure of the RF model is shown in FIG. 3, and the calculation process can be divided into five stages of input, preprocessing, classification, fitting and output. In the input phase, the model receives the SMILES encoding of the molecule, the functional type, the basis function (labeled x). In the preprocessing stage, the number of atoms, branched chains, branched chain atoms, ring atoms and double bonds in molecules are counted based on SMILES codes of the molecules, the information is spliced into a vector representing the structural characteristics of the molecules, and the vector is sent to a random forest classifier. In the classification stage, the random forest classifier gives probabilities (respectively marked as P) that the input molecular structures respectively belong to five typical structures (linear structure, branched chain structure, cyclic structure, linear chain olefin structure and polyphenyl structure)L,PD,PR,PA,PP). In the fitting stage, the pre-trained five feedforward neural network models (respectively corresponding to five typical structures) are used for respectively predicting property parameters (marked as f) based on the number of basis functionsL(x),fD(x),fR(x),fA(x),fP(x) ). Finally, the prediction result output by the model is the superposition of the prediction properties of each submodule. For linear properties, for example, the model may be expressed as

y=PLfL(x)+PDfD(x)+PRfR(x)+PAfA(x)+PPfP(x)

b) The structure of the Bi-LSTM model can be divided into five levels as shown in FIG. 4. First is the input layer, which receives as input the SMILES code in the form of one-hot, functional type, basis function (labeled x), which is put into the Word Embedding (Word Embedding) layer. At the word embedding layer, a weight matrix (called word embedding, denoted as W) is pre-trained by using a word2vec algorithm, and SMILES codes in a unique heat form are converted into real number vectors by using the word embedding and are sent into a bidirectional LSTM layer (a forward LSTM layer and a backward LSTM layer). The bidirectional LSTM layer extracts the high-level structural features contained in SMILES to obtain two output vectors (marked as H) in the forward direction and the backward directionfAnd Hb). Next is AttentThe ion layer, which receives as input the sum of the LSTM layer output vectors (denoted H, H ═ H)f+Hb) The output of the Attention layer is denoted as vector c,

c=HaT

a=softmax(wTtanh(H))

and finally, an output layer, wherein the output c of the Attention layer and the basis function x are sent to a full-connection network together for fitting to obtain a final property prediction result.

c) The structure of the MPNN model is shown in fig. 5, and the calculation process thereof can be divided into five stages of input, preprocessing, message transmission, reading and output. In the input stage, the molecule is modeled into a graph form (denoted as G) according to the molecular structure information, and the input of the model comprises a vertex vector (denoted as x) of Gv) Set and edge vector (e)vw) And (4) collecting. The components of each vertex vector hold the atom type of the atom to which the vertex corresponds, whether on an aromatic ring, and the hybrid type, and each edge vector holds the type of chemical bond to which the edge corresponds. In a pretreatment stage xvThe message passing phase performs T convolution operations on the point embedding, the T th convolution operation is performed by a message function MtAnd a vertex update function UtTo define the hidden state of the vertex vBy "messages"

Figure BDA0002490266490000052

To be updated. Thus, the operations performed by the messaging phase can be generalized as:

where N (v) represents the neighborhood of vertex v, MtIs defined as M (h)v,hw,eew)=A(evw)hwW denotes a vertex, hwIndicating a hidden state of the vertex w, A (e)vw) Is a network (called "edge network"), evwRepresenting an edge vector formed by connecting the vertices v and w, and an edge vector evwA matrix mapped as d × d (called "edge embedding"). the vertex update function UtIs a Gated Recurrentunit (GRU). In the read phase, the read function (read out function) R is used to obtain a high level representation of the graph G

Where R is a set2set model. In the output stage, theAnd sending the information together with the functional and the basis function information into a full-connection network for fitting to obtain a prediction result of the molecular property.

d) The MGCN model is structured as shown in fig. 6, and the calculation process thereof can be divided into five stages, i.e., input, preprocessing, message transmission, readout, and output. In the input stage, the molecule is modeled into a graph form (denoted as G) according to the molecular structure information, and the input of the model comprises a vertex vector (denoted as a) of G0) A set and a set of edge vectors (e). The components of each vertex vector store the atom type, whether on an aromatic ring, and the hybrid type of the atom to which the vertex corresponds, and each edge vector stores the type of chemical bond and bond length to which the edge corresponds. In the preprocessing stage, the vertex vector set and the edge vector set are converted into vertex embedding by an embedding layer

Figure BDA0002490266490000058

And edge embedding

Figure BDA0002490266490000059

Meanwhile, the Radial Basis Function (RBF) layer converts the bond length into a distance tensor

Figure BDA00024902664900000510

Component D of DijRepresenting the distance between atoms i and j. In the message transmission stage, the interaction layer (interaction layer) is constructed into a hierarchical structure form, in order to simulate the quantum interaction between atoms, and the output edge state of the first layer interaction layer is recorded asThe vertex state is

Figure BDA0002490266490000062

Then:

wherein N represents all molecules in all molecular systems, dijRepresents the distance between atoms i and j; h iseIs an edge state update function, hvIs a vertex state update function, heThe concrete form of (A) is as follows:

wherein η is a constant, set here to 0.8, WueIs a matrix of weights that is a function of,representing an element-by-element addition and ⊙ representing an element-by-element multiplication hvThe concrete form of (A) is as follows:

wherein v is a vertex representing an atom in the graph, and u is a point in a Radial basis of gaussian (Radial basis function) as an auxiliary, and is a parameter for auxiliary representation of spatial properties; f represents a function, fa, fd, fe represent functions related to vertex, distance, and edge, respectively; m (x) represents a linear network, which is a fully connected layer, and has the form of m (x) ═ Wx + b, W is a weight matrix, x represents the content in the small brackets behind W, and b represents a constant term, and b is automatically adjusted during optimization. Then, the outputs of the T interaction layers are outputtedWith the initial vertex state

Figure BDA0002490266490000069

Spliced to obtain a vector ai. Thereafter, the reread phase generates a high level feature representation G of graph G:

Figure BDA00024902664900000610

in the formula, r represents a weight matrix in the final readout layer of the MGCN model, and can be automatically optimized in the model training process; σ represents the softplus function, which is the activation function. And in the output stage, the high-level feature representation g and the number of basis functions are sent to a full-connection network together for fitting to obtain a prediction result of the molecular property.

(3) Scheduling module

The module is primarily responsible for interacting with the machine learning/artificial intelligence model library of the prediction module, the behavior of which depends on the user input information communicated from the input module. For the sake of illustration, the set of states that a molecule of a given structure has after determining a computational strategy is referred to as a "result space", each result space corresponding to a property computation scheme, i.e., a combination of a particular convergence strategy, a particular functional, and a basis set. And selecting a plurality of convergence strategies, functional and base group combinations, training a plurality of machine learning models aiming at each combination, and packaging in a prediction module. The result space corresponding to the selected combination of the functional and the basis set is referred to as "known space", and the result spaces other than the known space are referred to as "unknown space".

The behavior of the scheduling module may be as follows:

1) if the functional and the base group information in the input information belong to a known space, the scheduling module directly transmits the input information to a corresponding model in a machine learning model library, the model receives the input information to deduce, and returns a prediction result to the scheduling module, and the scheduling module transmits the prediction result to the output module.

2) If the functional and the base group information in the input information belong to unknown space, the scheduling module adopts an approximate strategy to infer the property prediction result of the molecule. The approximation strategy is divided into two categories:

a) similar space strategy

This approach compares known (where "known" means that the basis set or functional is contained in a known space) and unknown (where "known" means "where the basis set or functional is contained in a known space) basis sets or functional similarities applicable to the input molecules. Basis set similarity herein means that the two basis sets, although of different types, have the same number of basis functions. Functional similarity refers to the situation where two functionals belong to the same class. For a given input molecule, if the base set is known and the functional is unknown, the space which is the same as the base set and similar to the functional can be found out in the known space, and at the moment, the scheduling module calls the corresponding model from the prediction module to obtain a prediction result and transmits the result to the output module.

b) Hyper-plane (fitting) strategy

This strategy corresponds to the case where the basis set, functional, adopted by the input molecule is unknown. As the same molecule can be used as a ligament molecule to connect different chemical worlds, the space formed by the ligament molecule is a hyperplane. In the hyperplane, the gradient relationship of the property features of the same molecule in different result spaces can be derived using simple fitting or machine learning methods as described previously. Through a plurality of ligament molecules, the gradual change relationship of more universal molecular property characteristics in different result spaces can be averaged out. On the basis of the known gradient relation, the property characteristics of the molecules to be predicted in any result space can be deduced through a few built-in data of the models.

(4) Output module

The output module receives the prediction result transmitted by the scheduling module and outputs the result.

The molecular property prediction system of the embodiment is realized by adopting python language, and the system supports a molecular file in an SDF format as input by means of an RDkit module. The system constructs the object types supported by python based on the molecular files in the SDF format. In addition, the functional and basis set types adopted by the molecules need to be specified during input. After the group type is determined, the system calculates the corresponding number of basis functions by means of the information provided by the "Basisset Exchange" quantum chemistry database. The random forest classifier of the RF model is realized by using a scimit-learn module, and the five feedforward neural networks are realized by using a Tensorflow deep learning framework. The Bi-LSTM, MPNN and MGCN models are all implemented using a PyTorch deep learning framework.

The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:大分子及团簇体系分块计算负载均衡方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!