Molecular similarity search

文档序号：9898 发布日期：2021-09-17 浏览：31次中文

阅读说明：本技术 分子相似性搜索 (Molecular similarity search ) 是由 E·埃雷兹于 2021-03-15 设计创作，主要内容包括：一种用于查找与查询分子相似的分子的系统,包括：GCN、PFS向量提取器、经补偿的向量比较器(CVC)以及候选向量选择器。GCN已经被训练以分别根据输入查询分子向量或输入候选分子向量输出分子属性向量。GCN将查询原子特征集(AFS)向量和候选AFS向量转换为查询属性特征集(PFS)嵌入向量和候选PFS嵌入向量。PFS向量提取器从经训练的GCN的隐藏层中提取查询PFS嵌入向量和候选PFS嵌入向量。经补偿的向量比较器(CVC)针对至少一对查询PFS嵌入向量和一个候选PFS嵌入向量,计算经补偿的相似性度量(CSM)。候选向量选择器仅选择这种候选分子向量。(A system for finding molecules similar to a query molecule, comprising: a GCN, a PFS vector extractor, a Compensated Vector Comparator (CVC), and a candidate vector selector. The GCN has been trained to output molecular attribute vectors based on the input query molecular vector or the input candidate molecular vector, respectively. The GCN converts a query Atomic Feature Set (AFS) vector and candidate AFS vectors into a query attribute feature set (PFS) embedded vector and candidate PFS embedded vectors. The PFS vector extractor extracts the query PFS embedded vector and the candidate PFS embedded vectors from the hidden layer of the trained GCN. A Compensated Vector Comparator (CVC) computes a Compensated Similarity Measure (CSM) for at least one pair of a query PFS embedded vector and one candidate PFS embedded vector. The candidate vector selector selects only such candidate molecular vectors.)

1. A method for finding molecules similar to a query molecule, the method comprising:

converting a query Atomic Feature Set (AFS) vector and a candidate AFS vector into a query attribute feature set (PFS) embedded vector and a candidate PFS embedded vector using a GCN that has been trained to output molecular attribute vectors from an input query molecular vector or an input candidate molecular vector, respectively;

extracting a query PFS embedded vector and a candidate PFS embedded vector from a hidden layer of a trained GCN;

computing a Compensated Similarity Measure (CSM) for at least one pair of the query PFS embedded vector and one of the candidate PFS embedded vectors; and

selecting only such candidate molecular vectors whose said CSM value is above a predetermined threshold value.

2. The method of claim 1, wherein the compensation attempts to compensate for inaccuracies caused by varying positions of an atomic feature set at an input layer of the trained GCN.

3. The method of claim 1, wherein the calculating comprises:

embedding a vector for each candidate PFS:

summing all possible combinations of dot products between the set of attribute features in the query PFS embedded vector and the set of attribute features in the candidate PFS embedded vector; and

the dot product sum is normalized by dividing it by the number of sets of attribute features in the candidate PFS embedding vector.

4. The method of claim 1, wherein the trained GCN comprises an input layer, four hidden layers, and an output layer.

5. The method of claim 1, wherein each of the PFS embedded vectors comprises a plurality of sets of attribute features.

6. The method of claim 1, wherein the attribute of the trained GCN is one of: solubility, blood brain barrier and toxicity.

7. The method of claim 4, wherein extracting query PFS embedded vectors and candidate PFS embedded vectors is performed at an output of a fourth said hidden layer.

8. The method of claim 1, wherein the candidate AFS vector is a vector used to train the GCN.

9. The method of claim 1, wherein adjusting the predetermined threshold value changes the number of candidate molecular vectors that are considered similar to the query molecular vector.

10. A system for finding molecules similar to a query molecule, the system comprising:

a GCN that has been trained to output molecular attribute vectors from an input query molecular vector or an input candidate molecular vector, respectively, for converting a query Atomic Feature Set (AFS) vector and a candidate AFS vector into a query attribute feature set (PFS) embedded vector and a candidate PFS embedded vector;

a PFS vector extractor for extracting a query PFS embedded vector and candidate PFS embedded vectors from a hidden layer of the trained GCN;

a Compensated Vector Comparator (CVC) for calculating a Compensated Similarity Measure (CSM) for at least one pair of the query PFS embedded vector and one of the candidate PFS embedded vectors;

a candidate vector selector for selecting only such said candidate molecular vectors for which said CSM value is above a predetermined threshold value.

11. The system as recited in claim 1, wherein the Compensated Vector Comparator (CVC) attempts to compensate for inaccuracies caused by varying positions of an atomic feature set at an input layer of the trained GCN.

12. The system of claim 11, wherein the CVC comprises:

a dot product summer for summing, for each candidate PFS embedded vector, all possible combinations of dot products between the set of attribute features in the query PFS embedded vector and the set of attribute features in the candidate PFS embedded vector; and

a DPS normalizer to normalize, for each candidate PFS embedded vector, the DPS by dividing the DPS by the number of feature sets of the attribute by the candidate PFS embedded vector.

13. The system of claim 10, wherein the trained GCN comprises an input layer, four hidden layers, and an output layer.

14. The system of claim 10, wherein each of the PFS embedding vectors comprises a plurality of sets of attribute features.

15. The system of claim 10, wherein the attribute of the trained GCN is one of: solubility, blood brain barrier and toxicity.

16. The system of claim 13, wherein the PFS vector extractor extracts a query PFS embedded vector and a candidate PFS embedded vector from an output of the fourth hidden layer.

17. The system of claim 10, wherein the candidate AFS vector is a vector used to train the GCN.

18. The system of claim 10, wherein the candidate vector selector is operable to change the value of the predetermined threshold value in order to change the number of candidate molecular vectors that are considered similar to the query molecular vector.

Technical Field

The present application relates generally to similarity searches, and specifically to molecular similarity searches.

Background

One of the mainstays of the pharmaceutical industry is small molecule drugs. Pharmaceutical researchers search for molecules that will inhibit enzymes or activate receptors in their desired way, for example. It is known to use Artificial Intelligence (AI) for molecular property prediction.

Drug manufacturers use molecular similarity searches to attempt to predict attributes such as: solubility-the extent to which molecules may dissolve into the blood or enter the cell membrane; toxicity-the extent to which a molecule may damage an organism; and whether Blood Brain Barrier (BBB) -molecules enter the brain. After first screening molecules for structure, researchers employ deep learning techniques to find molecules with desirable properties similar to known molecules.

Researchers use a neural network, in this case a Convolutional Neural Network (CNN) or a Graphical Convolutional Network (GCN), as a mathematical model to identify the properties of the molecules. These may be implemented on software platforms (e.g., Rdkit, Deepchem, etc.).

Referring now to fig. 1A and 1B, a GCN1 is shown, which includes a plurality of neural layers; an input layer 2, a plurality of hidden layers 3 and an output layer 4. Each layer comprises a plurality of nodes 6 and the nodes in each layer may be connected by a plurality of connections 7. Each node may be fully connected to each node in the previous and subsequent layers, but this is not required.

As described in detail below, an input vector V representing the structure and atomic characteristics of a molecule_iEnter GCN1 at input level 2 and traverse hidden level 3, and output vector V_oLeaving the GCN1 at the output layer 4.

There are two main modes of operating the GCN: training mode and operational mode (including testing, validation and periodic use of GCN 1). During training, will have a known output value V_oIs inputted toQuantity V_iPassed through GCN 1. For example, node 6, weight W, connection 7, and other features of GCN1 (described further below) are adjusted by cross-entropy loss, so when V is_iWhen traversing GCN1, GCN1 will change V_iConverted to be equal to the known value V at the output layer 4_o. Training the GCN to perform accurate conversions is a complex task, as is well known in the art.

Once the GCN is trained, another set of input vectors is used to test and verify whether the GCN conversion is reliable and accurate. Passing another set of test input vectors, also having known output values, through GCN1, and passing the actual V_oResults with known V_oThe values are compared. If the results are acceptable, the GCN is considered to have been trained. Once trained, the GCN can be used to predict the output of unknown query vectors.

Researchers strive to create a perfect transformation model within the GCN that will generate the desired output for a given input. For example, the structural and atomic properties (called features) of a molecule can be input into the GCN, and the toxicological properties of such a molecule can be predicted at the output. As known to those skilled in the art, various deep learning techniques are used to improve the GCN during the training phase of the GCN. These techniques include, but are not limited to, neighbor feature aggregation layers, normalization layers, pooling layers, nonlinear conversion layers, readout layers, and the like. Current GCN technology is described in: the web site publication "Deep Learning" at http:// www.deeplearningbook.org; an article published by ACM2019, "SimGNN: A Neural Network Approach to Fast Graph Similarity calculation"; and Semi-Supervised Classification with Graph conditional Networks, published by ICLR 2017.

Using the toxicology example mentioned above, the united states environmental protection agency, the united states national toxicology program, the united states national center for advanced transformation science, and the united states food and drug administration form the Tox21 consortium that creates the Tox21 molecular attribute dataset. The Tox21 dataset includes: a database of over 12,000 molecules for training, validation and testing of GCNs. The training molecule has a set of 12 known toxicological attributes which are used by the GCN1 during training to self-adjust the nodes 6, connections 7, weights W and other GCN characteristics mentioned above, to train the GCN to output the correct Tox 2112 bit attribute set for a given input molecule.

The Tox21 data set has a set of input vectors with known output vectors that can be used to train GCN 1. Other sets of vectors are included in the data set for testing and verification. There are approximately 12,000 vectors available in total. The training subset is selected to reflect the range of input types used with GCN 1. Likewise, a validation vector is a set of molecules that will test the full width of the performance of the GCN, but are not used during training. Finally, when GCN1 is tested and verified, unknown molecular vectors are input into GCN1 and their Tox21 attributes are predicted at output 4.

Referring now to fig. 2A and 2B, input and output vectors of GCN1 are shown. Each input vector V_iIncluding an Atomic Feature Set (AFS)10 and a Spatial Data File (SDF)11 for all s atoms in the molecule. Each AFS 10 describes one atom in the input molecule and includes 128 features. SDF 11 defines the structure and adjacency of atoms within a molecule and is used by GCN1 to account for the effects of adjacent atoms.

Output vector V_oIs a 12-bit binary vector representing the Tox21 molecular attribute 13 of the molecule. These 12 attributes were classified into a 7-bit "nuclear receptor panel" with 7 toxicological attributes: (1) estrogen receptors α, LBD (ER, LBD); (2) estrogen receptor α, holo (ER, holo); (3) an aromatase enzyme; (4) arene receptors (AhR); (5) androgen receptor, all (AR, all); (6) androgen receptor, LBD (AR, LBD); (7) peroxisome proliferator-activated receptor gamma (PPAR- γ), and 5 "stress response groups" with 5 toxicological attributes: (8) nuclear factor (carotenoid derivative 2) like 2/antioxidant response element (Nrf 2/ARE); (9) a thermal shock factor response element (HSE); (10) ATAD 5; (11) mitochondrial Membrane Potential (MMP); (12) p 53.

Disclosure of Invention

According to a preferred embodiment of the present invention, a method for finding molecules similar to a query molecule is provided. The method comprises the following steps: the query Atomic Feature Set (AFS) vector and the candidate AFS vector are converted into a query attribute feature set (PFS) embedded vector and a candidate PFS embedded vector using a GCN that has been trained to output molecular attribute vectors from an input query molecular vector or an input candidate molecular vector, respectively. The method further comprises the following steps: extracting a query PFS embedded vector and a candidate PFS embedded vector from a hidden layer of a trained GCN; computing a Compensated Similarity Measure (CSM) for at least one pair of the query PFS embedded vector and one candidate PFS embedded vector; and selecting only such candidate molecular vectors whose CSM value is above a predetermined threshold value.

Furthermore, in accordance with a preferred embodiment of the present invention, the compensation attempts to compensate for inaccuracies caused by the varying position of the atomic feature set at the input layer of the trained GCN.

Further in accordance with a preferred embodiment of the present invention the calculating comprises: for each candidate PFS embedding vector, summing all possible combinations of dot products between the set of attribute features in the query PFS embedding vector and the set of attribute features in the candidate PFS embedding vector, and normalizing the dot product sums by dividing the dot product sum by the number of sets of attribute features in the candidate PFS embedding vector.

Furthermore, in accordance with a preferred embodiment of the present invention, the trained GCN includes an input layer, four hidden layers, and an output layer.

In addition, according to a preferred embodiment of the present invention, each PFS embedding vector includes a plurality of attribute feature sets.

Furthermore, according to a preferred embodiment of the invention, the attributes of the trained GCN are solubility, blood brain barrier or toxicity.

Furthermore, in accordance with a preferred embodiment of the present invention, extracting the query PFS embedded vector and the candidate PFS embedded vector is performed at the output of the fourth hidden layer.

Furthermore, in accordance with a preferred embodiment of the present invention, the candidate AFS vectors are vectors used to train the GCN.

Additionally in accordance with a preferred embodiment of the present invention, adjusting the predetermined threshold value changes the number of candidate molecular vectors that are considered similar to the query molecular vector.

There is also provided in accordance with a preferred embodiment of the present invention a system for finding molecules similar to a query molecule. The system comprises: GCN, PFS vector extractor, Compensated Vector Comparator (CVC), and candidate vector selector. The GCN has been trained to output molecular attribute vectors based on the input query molecular vector or the input candidate molecular vector, respectively. The GCN converts a query Atomic Feature Set (AFS) vector and candidate AFS vectors into a query attribute feature set (PFS) embedded vector and candidate PFS embedded vectors. The PFS vector extractor extracts the query PFS embedded vector and the candidate PFS embedded vectors from the hidden layer of the trained GCN. A Compensated Vector Comparator (CVC) computes a Compensated Similarity Measure (CSM) for a pair of one query PFS embedded vector and one candidate PFS embedded vector. The candidate vector selector selects only such candidate molecular vectors whose CSM value is above a predetermined threshold value.

Additionally, in accordance with a preferred embodiment of the present invention, a Compensated Vector Comparator (CVC) attempts to compensate for inaccuracies caused by the varying position of the atomic feature set at the input layer of the trained GCN.

Further in accordance with a preferred embodiment of the present invention, the CVC includes: a dot product summer and a DPS normalizer. The dot product summer sums, for each candidate PFS embedded vector, all possible combinations of dot products between the set of attribute features in the query PFS embedded vector and the set of attribute features in the candidate PFS embedded vector. The DPS normalizer normalizes the DPS by dividing it by the number of attribute feature sets in the candidate PFS embedded vector for each candidate PFS embedded vector.

Further in accordance with a preferred embodiment of the present invention the candidate vector selector varies the value of the predetermined threshold value so as to vary the number of candidate molecular vectors that are considered similar to the query molecular vector.

Drawings

The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:

FIGS. 1A and 1B are illustrations of a GCN including multiple neural layers;

FIGS. 2A and 2B are diagrams of input and output vectors of a GCN;

FIG. 3 is a schematic representation of a toxicology molecular similarity search system;

FIG. 4 is a diagram of layers in an embodiment of a trained GCN;

FIG. 5A is a diagram of a TFS embedding vector;

FIG. 5B is a diagram of a Compensated Vector Comparator (CVC);

FIG. 6A is a diagram of an exemplary query TFS embedding vector;

FIG. 6B is a graphical representation of an example of the sum of TFS dot products;

FIG. 7 is a diagram of a general molecular similarity search system.

It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

Detailed Description

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.

Applicants have recognized that in toxicology trained Graphical Convolutional Networks (GCN), as an input vector comprising a plurality of Atomic Feature Sets (AFS) traverses from an input layer and through a plurality of hidden layers, its AFS data is converted into Toxicology Feature Set (TFS) data, which is then further converted into toxicology attribute vectors at an output layer.

Applicants have recognized that this is not only true for toxicology, but also for other molecular attributes, such as Blood Brain Barrier (BBB), solubility, and other attributes. In such a GCN trained on specific molecular attributes, AFS data is converted into attribute feature sets (PFS) as the input vector traverses the GCN, and then further converted into the appropriate attribute vector at the output layer. This application uses toxicology as an example.

Applicants have also recognized that instead of using the toxicology output vector from such a toxicology GCN, TFS embedded vectors may be extracted from within the hidden layer of the GCN and used outside the GCN to mathematically compare their toxicology attributes to other extracted TFS embedded vectors.

Applicants have appreciated that the order in which atoms are presented to the input layers of the GCN may affect output accuracy. For example, the AFS vector of a water molecule having two hydrogen atoms and one oxygen atom can be presented to the GCN input layer as H-H-O, H-O-H or O-H-H.

Referring now to FIG. 3, a molecular similarity search system 14 is shown, the system 14 comprising: GCN16 that has been trained using the Tox21 dataset; a toxicological molecule candidate database 18, the database 18 comprising, for example, a Tox21 molecular vector c_AFS,i(as depicted in FIG. 2A); toxicity Feature Set (TFS) vector extractor 20 for extracting query TFS embedded vector q from within GCN16_TFSAnd a plurality of candidate TFS embedding vectors c_TFS,i(ii) a A TFS embedded vector database 22 for storing TFS embedded vectors q_TFSAnd c_TFS,i(ii) a A Compensated Vector Comparator (CVC)24 for embedding the vector q in the TFS_TFSAnd c_TFS,iCalculates a Compensated Similarity Measure (CSM) M therebetween_cvc，iThereby minimizing q_AFSAnd c_AFSThe influence of the order of the primitive data; CSM database 26 for storing CSM_cvc,i(ii) a And a candidate vector selector 28 for selecting a vector q considered similar to the query vector q_AFSThose candidate vectors c of_AFS,i。

For example, any GCN may be used. Referring now to FIG. 4, the layers in the embodiment of the trained GCN16 shown in FIG. 3 are shown. The GCN16 may be configured with an input layer 30 comprising 128 nodes, four hidden layers 32 also comprising 128 nodes, and an output layer 34 comprising 12 nodes. The GCN16 of fig. 4 utilizes 4 hidden layers to compute the effect of neighboring atoms defined in the SDF file mentioned above. At the input layer 30, the calculation considers only the set of atomic features of the molecular atoms alone. For example, if H-O-H is present at the input layer 30, only the impact of the feature set of H is computed at the first node, only the impact of O is computed at the second node, and only the impact of the second H is computed at the third node.

At the first hidden layer 32, the impact of the feature set of the first level neighboring atoms is also computed. At the first node, H-O is included, at the second node H-O-H is included, and at the third node O-H is included. At the third hidden layer 32, a second neighbor is included, which is H-O-H on the first node and H-O-H on the third node, and at the fourth hidden layer 32, a third neighbor is included. At H₂In the O example, there is no third neighbor, but in the Tox21 dataset there are about 20 atoms per molecule, and the influence of neighboring atoms on the calculation may be greater.

As noted above, a number of deep learning techniques are applied within the GCN to improve the performance and accuracy of the GCN. In a preferred embodiment of the invention, at the output of the first hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relu; a drop-off (dropout) layer 38 set to 0.1; a bulk normalization layer 40; and a graph pooling layer 42 that is set to maximize pooling on the feature vector for atoms and their neighbors in the bond map. At the output of the second hidden layer 32 there are: a non-linear translation (NLT) layer 36 containing 128 Relu; a release layer 38 disposed at 0.1; batch normalization layer 40. On the output of the third hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relu; and a bulk normalization layer 40; and at the output of the fourth hidden layer 32: a non-linear translation (NLT) layer 36 containing 128 Relu, batch normalization 40; a pattern pooling layer 42; a densified layer 44; another batch normalization layer 40; a pattern collection layer 46; and a Softmax layer 48.

It should be understood that the particular technique, number of layers, and number of nodes employed in the GCN16 may vary and are presented herein as examples of configuring a neural network.

Applicants have recognized that orientation in the Tox21 datasetThe quantities may be used not only to train the GCN, but also to generate candidate TFS embedding vectors c_TFS,iEmbed it with query TFS into vector q_TFSA comparison is made.

Returning to FIG. 3, the molecular similarity search system 14 obtains a candidate vector c from the toxicological molecule candidate database 18 containing, for example, about 12000 Tox21 molecular sample vectors_AFSAnd pass them through toxicologically-trained GCN 16. The TFS vector extractor 20 extracts a candidate TFS embedded vector c from the output of the fourth hidden layer 32 (as shown in fig. 4) preceding any of the output adjustment layers mentioned above_TFS,iThey are then stored in the TFS embedding vector database 22. Query vector q_AFSIs also input to the GCN16, and the TFS vector extractor 20 may extract the query TFS embedded vector q_TFSAnd may be stored in the TFS embedding vector database 22.

Referring briefly to FIG. 5A, a TFS embedding vector V is shown_TFSIt may be a candidate vector c_TFS,iOr query vector q_TFS. The TFS embedded vector includes a plurality of TFSs 50, one TFS 50 for each of the t atoms in the molecular vector. Such TFS embedded vectors may be stored in a TFS embedded vector database 22.

Applicants have recognized that the input vector V_AFSThe arrangement of the middle atomic feature set may also affect the TFS embedding vector V_TFSThe toxicity feature set in (1). Applicants have also recognized that embedding vector V into TFS_TFSThe calculations performed require compensation for the TFS embedded vector V_TFSOf TFS is used. Applicants have realized that in the toxicology example, by using the normalized sum of TFS dot products between pairs of embedded vectors as a metric, such localization effects are minimized and a more accurate similarity metric for the vector pairs can be calculated.

Referring now to fig. 5B, the CVC 24 is shown including a dot product adder 51 and a dot product normalizer 52. Dot product adder 51 may obtain query TFS embedded vector q from TFS embedded vector database 22_TFSAnd candidate TFS embedding vector c_TFS,iAnd calculates the dot product sum of the vectors.

Referring now to FIG. 6A, an exemplary query TFS embedding vector q is shown_TFSAnd candidate TFS embedding vector c_TFS,i. Query TFS embedding vector q_TFSIncludes two toxicity feature sets 50-TFS_q1And TFS_q2(ii) a And candidate TFS is embedded in vector c_TFS,iIncludes three toxicity feature sets 50-TFS_c1、TFS_c2And TFS_c3. Referring now to FIG. 6B, a diagram illustrating embedding of a vector q_TFSAnd c_TFS,iExample of the sum of TFS dot products in between. Dot product adder 51 computes a query TFS embedding vector q_TFSToxicity feature set 50 and candidate TFS embedding vector c_TFS,iAll dot product DPS (q) for all combinations of toxicity feature set 50_TFS，c_TFS,i) The sum, as shown in FIG. 6B and in equation (1):

DPS(q_TFS,c_TFS,i)＝[TFS_q1·TFS_c1]+[TFS_q1·TFS_c2]+[TFS_q1·TFS_c3]+[TFS_q2·TFS_c1]+[TFS_q2·TFS_c2]+[TFS_q2·TFS_c3]equation (1)

Dot product and normalizer 52 then normalizes the DPS (q) by_TFS，c_TFS,i) Normalisation (by dividing it by the candidate vector c_TFS,iThe atomic number t (3 in this example)) to complete the CSM calculation, as shown in equation (2):

M_CVC,inormalized DPS (q)_TFS,c_TFS,i)＝[DPS(q_TFS,c_TFS,i)]T equation (2)

The CVC 24 then queries each TFS for-candidate pair q_TFS-c_TFS,iEach M of_CVC,iStored in the CSM database 26. Candidate vector selector 28 then selects M_CVC,iUsed as a score, which then selects for that score only those candidate vectors C having a score above the candidate score threshold_AFS,i. Those candidates that score above this threshold are considered to be similar to the query vector q_AFS。

It should be noted that the above-described embodiments may be implemented on any suitable computing device. All databases may be implemented as separate databases or portions of a single database. The extracted TFS embedding vector may be used for any calculation, not only the similarity measure as shown above. TFS embedded vectors can be extracted from GCNs trained with any training vector set, not just toxicity vectors as shown above.

Applicants have also recognized that by enabling candidate vector selector 28 to adjust the threshold score by which candidates are considered similar, a user may have the flexibility to adjust the size of the candidate pool without having to retrain the neural network.

Applicants have also recognized that computations may be implemented as simple boolean functions, and may be performed concurrently on all candidate vectors in parallel on an associative memory array (e.g., a Gemini associative processing unit, commercially available from GSI Technologies Inc, usa).

As mentioned above, any molecular attribute (e.g., solubility, BBB, or other attribute) may be used to train such GCNs. Referring now to FIG. 7, a generalized molecular similarity search system 60 is shown, comprising: GCN 62 that has been trained using any known molecular attribute; comprising a molecular vector c_AFS,iThe molecular candidate database 64 (as depicted in fig. 2A); an attribute feature set (PFS) vector extractor 66 for extracting a query PFS embedded vector q from within the GCN 62_PFSAnd a plurality of candidate PFS embedding vectors c_PFS,i(ii) a PFS embedding vector database 68 for storing PFS embedding vector q_PFSAnd c_PFS,i(ii) a A Compensated Vector Comparator (CVC)70 for computing an embedded vector q in the PFS_PFSAnd c_PFS,iCompensated Similarity Measure (CSM) M between_cvc,iWhich attempts to minimize q_AFSAnd c_AFSThe influence of the order of the primitive data; CSM database 72 for storing CSM_cvc,i(ii) a And a candidate vector selector 74 for selecting a vector q considered similar to the query vector q_AFSThose candidate vectors c of_AFS,i。

Unless specifically stated otherwise, as apparent from the preceding discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing," "computing," "calculating," "determining," or the like, refer to the action and/or processes of any type of general purpose computer (e.g., a client/server system, a mobile computing device, a smart appliance, a cloud computing unit, or similar electronic computing device that manipulates and/or transforms data within the computing system's registers and/or memories into other data within the computing system's memories, registers, or other such information storage, transmission or display devices).

Embodiments of the present invention may include apparatuses for performing the operations herein. The apparatus may be specially constructed for the desired purposes, or it may comprise a computing device or system, typically having at least one processor and at least one memory selectively activated or reconfigured by a computer program stored in the computer. When instructed by software, the resulting apparatus may transform a general-purpose computer into the inventive elements as discussed herein. The instructions may define the inventive apparatus to operate with the desired computer platform. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including optical disks, magneto-optical disks, read-only memories (ROMs), volatile and non-volatile memories, Random Access Memories (RAMs), electrically programmable read-only memories (EPROMs), Electrically Erasable and Programmable Read Only Memories (EEPROMs), magnetic or optical cards, flash memories, key disks, or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus. The computer-readable storage medium may also be implemented in cloud storage.

Some general purpose computers may include at least one communication element to enable communication with a data network and/or a mobile communication network.

The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.

While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

18页详细技术资料下载

Molecular similarity search

相关技术

网友询问留言