Drug molecule attribute determination method, device and storage medium

文档序号：1044857 发布日期：2020-10-09 浏览：20次中文

阅读说明：本技术 药物分子属性确定方法、装置及存储介质 (Drug molecule attribute determination method, device and storage medium ) 是由叶阁焰刘伟黄俊洲于 2020-07-30 设计创作，主要内容包括：本申请公开了一种药物分子属性确定方法、装置及存储介质,属于人工智能技术领域。方法包括：获取待测药物分子的文本字符串；该文本字符串用于描述待测药物分子的化学结构式；根据该文本字符串,获取待测药物分子的三维结构信息；根据待测药物分子的三维结构信息,确定待测药物分子的成药属性。本申请实施例提出了一种新的药物分子属性预测方案,该方案会获取待测药物分子的三维结构信息,其中,药物分子的三维结构信息给出了药物分子中各个原子在立体空间的位置分布,而药物分子的空间结构能够影响药物分子性质,因此基于药物分子的三维结构信息,能够精准地预测药物分子属性,进而能够提高新的候选药物的发现速度和降低研发成本。(The application discloses a method and a device for determining drug molecule attributes and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected; acquiring three-dimensional structure information of the drug molecules to be detected according to the text character string; and determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information of the drug molecules to be detected. The embodiment of the application provides a new drug molecule attribute prediction scheme, which can acquire three-dimensional structure information of a drug molecule to be detected, wherein the three-dimensional structure information of the drug molecule gives the position distribution of each atom in the drug molecule in a three-dimensional space, and the spatial structure of the drug molecule can influence the property of the drug molecule, so that the drug molecule attribute can be accurately predicted based on the three-dimensional structure information of the drug molecule, the discovery speed of a new candidate drug can be increased, and the research and development cost can be reduced.)

1. A method for determining a property of a drug molecule, comprising:

acquiring a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected;

acquiring three-dimensional structure information of the drug molecules to be detected according to the text character strings;

and determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information.

2. The method of claim 1, further comprising:

acquiring two-dimensional structure information of the drug molecules to be detected according to the text character strings;

acquiring the atomic characteristics and chemical bond characteristics of the drug molecules to be detected according to the text character strings;

determining the drug forming property of the drug molecule to be detected according to the three-dimensional structure information, wherein the determining comprises the following steps:

and determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics.

3. The method according to claim 1, wherein the obtaining the three-dimensional structure information of the drug molecule to be tested according to the text string comprises:

acquiring the three-dimensional structure coordinates of the drug molecules to be detected according to the text character strings;

and on the premise that the three-dimensional structure shape of the drug molecule to be detected is not changed, transforming the current three-dimensional structure coordinate of the drug molecule to be detected, and taking the obtained three-dimensional structure coordinate matrix as the three-dimensional structure information.

4. The method according to claim 2, wherein the obtaining the two-dimensional structure information of the drug molecule to be tested according to the text string comprises:

acquiring an adjacency matrix of the two-dimensional structure chart of the drug molecules to be detected according to the text character string;

and normalizing the adjacent matrix of the two-dimensional structure chart, and taking the obtained normalized adjacent matrix as the two-dimensional structure information.

5. The method of claim 2, wherein the determining the drug property of the drug molecule to be tested according to the three-dimensional structure information, the two-dimensional structure information, the atomic feature and the chemical bond feature comprises:

performing feature splicing processing on the three-dimensional structure information, the two-dimensional structure information, the atomic features and the chemical bond features to obtain a first splicing matrix;

inputting the first splicing matrix into a molecular attribute prediction network;

obtaining a predicted attribute value output by the molecular attribute prediction network; the prediction attribute value is used for indicating the drug forming attribute of the drug molecule to be detected.

6. The method according to claim 3, wherein the obtaining three-dimensional structure coordinates of the drug molecule to be tested according to the text string comprises:

acquiring a chemical structural formula of the drug molecule to be detected according to the text character string;

determining M three-dimensional structures with different configurations according to the chemical structural formula of the drug molecule to be detected; wherein the root mean square error between two three-dimensional structures having different configurations is greater than a first threshold; m is a positive integer;

performing energy minimization treatment on the M three-dimensional structures under a target molecular force field;

determining a target three-dimensional structure with the minimum energy in the M three-dimensional structures;

removing hydrogen atoms in the target three-dimensional structure to obtain a three-dimensional structure of the drug molecule to be detected;

and acquiring the three-dimensional coordinates of each atom in the drug molecules to be detected under the three-dimensional structure to obtain the three-dimensional structure coordinates of the drug molecules to be detected.

7. The method according to claim 3, wherein the transforming the current three-dimensional structure coordinates of the drug molecules to be detected to obtain the three-dimensional structure coordinate matrix of the drug molecules to be detected on the premise that the three-dimensional structure shape of the drug molecules to be detected remains unchanged comprises:

acquiring a random rotation matrix and a translation transformation matrix;

on the premise that the three-dimensional structure shape of the drug molecule to be detected is unchanged, respectively performing random rotation and translation transformation on the three-dimensional structure of the drug molecule to be detected according to the random rotation matrix and the translation transformation matrix to obtain a three-dimensional structure coordinate matrix;

wherein, the three-dimensional structure coordinate matrix comprises the new three-dimensional structure coordinate of the drug molecule to be detected.

8. The method according to claim 4, wherein the normalizing the adjacency matrix of the two-dimensional structure diagram to obtain a normalized adjacency matrix comprises:

transforming the value of the diagonal element of the adjacency matrix from a first value to a second value to obtain a new adjacency matrix;

and normalizing the new adjacency matrix according to rows to obtain the normalized adjacency matrix.

9. The method of claim 5, further comprising:

acquiring a training data set, wherein the training data set comprises sample molecules and attribute labels matched with the sample molecules;

acquiring a three-dimensional structure coordinate matrix, a normalized adjacency matrix, an atomic characteristic and a chemical bond characteristic of the sample molecule;

performing characteristic splicing treatment on the three-dimensional structure coordinate matrix, the normalized adjacency matrix, the atomic characteristics and the chemical bond characteristics of the sample molecules to obtain a second spliced matrix;

training an initial neural network by taking the second splicing matrix as an input of the initial neural network and taking an attribute label matched with the sample molecule as an output of the initial neural network;

obtaining a difference value between the predicted attribute value output by the initial neural network and the attribute label of the sample molecule based on a target loss function;

and responding to the difference value being larger than a second threshold value, and repeatedly and iteratively updating the network parameters of the initial neural network until the difference value is not larger than the second threshold value, so as to obtain the molecular attribute prediction network.

10. The method according to claim 5 or 9, wherein the molecular property prediction network comprises a feature coding layer, a pooling layer, and a linear layer;

the inputting the first splicing matrix into a molecular attribute prediction network to obtain a prediction attribute value output by the molecular attribute prediction network includes:

inputting the first splicing matrix into the feature coding layer and the pooling layer in sequence;

and inputting the coding vector output by the pooling layer into the linear layer, and taking the output of the linear layer as a prediction attribute value of the drug molecule to be detected.

11. The method according to claim 10, wherein the feature encoding layer comprises N layers of feature encoders with the same structure, which are sequentially stacked, where N is a positive integer; the method further comprises the following steps:

inputting the second splicing matrix as an input feature into a first layer feature encoder of the feature encoding layer;

sequentially coding the input features through each layer of feature coder which is stacked until the last layer of feature coder; wherein, the input of the feature encoder of the upper layer is used as the output of the feature encoder of the lower layer;

and taking the output of the last layer of feature encoder as the output feature of the feature encoding layer.

12. The method of claim 11, wherein each layer of feature encoder comprises a multi-head attention layer and a feedforward neural network layer;

the each layer of feature encoder through stacking setting is in proper order to the input feature carries out coding processing, including:

for an ith head structure of a multi-head attention layer contained in a jth layer feature encoder, acquiring a first linear transformation matrix, a second linear transformation matrix and a third linear transformation matrix corresponding to the ith head structure; wherein, the values of i and j are positive integers, and j is more than or equal to 1 and less than or equal to N;

performing linear transformation processing on the input characteristics of the ith head structure according to the first transformation matrix, the second transformation matrix and the third transformation matrix respectively to obtain a query sequence, a key sequence and a value sequence of the ith head structure in sequence; acquiring the output characteristic of the ith head structure according to the query sequence, the key sequence and the value sequence of the ith head structure;

performing feature splicing processing on the output features of each head structure to obtain combined features;

performing linear transformation processing on the combined features based on a fourth linear transformation matrix to obtain output features of the multi-head attention layer;

and inputting the output characteristics of the multi-head attention layer into the feedforward neural network layer, and taking the output of the feedforward neural network layer as the input characteristics of the j + 1-th layer characteristic encoder.

13. A drug molecule property determination apparatus, comprising:

the first acquisition module is configured to acquire a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected;

the second acquisition module is configured to acquire three-dimensional structure information of the drug molecules to be detected according to the text character strings;

and the prediction module is configured to determine the drug forming property of the drug molecule to be detected according to the three-dimensional structure information.

14. A computer device, characterized in that the device comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the method of drug molecule property determination according to any of claims 1 to 12.

15. A storage medium having stored therein at least one program code, the at least one program code being loaded into and executed by a processor to carry out the method of drug molecule property determination according to any one of claims 1 to 12.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a storage medium for determining drug molecule attributes.

Background

AI (Artificial Intelligence), is an emerging scientific technology currently being studied and developed for simulating, extending, and extending human Intelligence. The AI technology is widely used in many scenarios, such as drug development scenario.

For a drug development scenario, the Prediction of drug Molecular Property (MPP), also called the Prediction of drug Property of a drug, is performed. Exemplary drug molecule attributes include, but are not limited to: absorption (uptake) properties, Distribution (Distribution) properties, Metabolism (Metabolism) properties, Excretion (Excretion) properties and Toxicity (toxity) properties of drug molecules.

In the process of drug research and development, the discovery speed of new candidate drugs can be improved and the research and development cost can be reduced by predicting the drug forming property of drug molecules. In other words, accurately predicting the molecular properties of a drug is a key to increasing the speed of discovering new drug candidates and reducing the development cost.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining drug molecule attributes and a storage medium, which can remarkably improve the prediction accuracy of the drug molecule attributes. The technical scheme is as follows:

in one aspect, a method for determining a drug molecule property is provided, which includes:

acquiring a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected;

acquiring three-dimensional structure information of the drug molecules to be detected according to the text character strings;

and determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information.

In another aspect, there is provided a drug molecule property determination apparatus comprising:

the second acquisition module is configured to acquire three-dimensional structure information of the drug molecules to be detected according to the text character strings;

and the prediction module is configured to determine the drug forming property of the drug molecule to be detected according to the three-dimensional structure information.

In a possible implementation manner, the second obtaining module is further configured to obtain two-dimensional structure information of the drug molecule to be detected according to the text character string; acquiring the atomic characteristics and chemical bond characteristics of the drug molecules to be detected according to the text character strings;

the prediction module is configured to determine the drug formation property of the drug molecule to be tested according to the three-dimensional structure information, the two-dimensional structure information, the atomic feature and the chemical bond feature.

In a possible implementation manner, the second obtaining module includes:

a first obtaining unit configured to obtain three-dimensional structure coordinates of the drug molecule to be detected according to the text character string;

and the first processing unit is configured to transform the current three-dimensional structure coordinate of the drug molecule to be detected on the premise that the three-dimensional structure shape of the drug molecule to be detected remains unchanged, and take the obtained three-dimensional structure coordinate matrix as the three-dimensional structure information.

In a possible implementation manner, the second obtaining module further includes:

the second acquisition unit is configured to acquire an adjacency matrix of the two-dimensional structure chart of the drug molecules to be detected according to the text character string;

and the second processing unit is configured to perform normalization processing on the adjacent matrix of the two-dimensional structure chart, and take the obtained normalized adjacent matrix as the two-dimensional structure information.

In one possible implementation, the prediction module is configured to:

inputting the first splicing matrix into a molecular attribute prediction network;

In a possible implementation manner, the first obtaining unit is configured to:

acquiring a chemical structural formula of the drug molecule to be detected according to the text character string;

performing energy minimization treatment on the M three-dimensional structures under a target molecular force field;

determining a target three-dimensional structure with the minimum energy in the M three-dimensional structures;

removing hydrogen atoms in the target three-dimensional structure to obtain a three-dimensional structure of the drug molecule to be detected;

In one possible implementation, the first processing unit is configured to:

acquiring a random rotation matrix and a translation transformation matrix;

wherein, the three-dimensional structure coordinate matrix comprises the new three-dimensional structure coordinate of the drug molecule to be detected.

In one possible implementation, the second processing unit is configured to:

transforming the value of the diagonal element of the adjacency matrix from a first value to a second value to obtain a new adjacency matrix;

and normalizing the new adjacency matrix according to rows to obtain the normalized adjacency matrix.

In one possible implementation, the training process of the molecular property prediction network includes:

acquiring a training data set, wherein the training data set comprises sample molecules and attribute labels matched with the sample molecules;

acquiring a three-dimensional structure coordinate matrix, a normalized adjacency matrix, an atomic characteristic and a chemical bond characteristic of the sample molecule;

obtaining a difference value between the predicted attribute value output by the initial neural network and the attribute label of the sample molecule based on a target loss function;

In one possible implementation, the molecular property prediction network includes a feature encoding layer, a pooling layer, and a linear layer;

the training process of the molecular attribute prediction network comprises the following steps:

inputting the first splicing matrix into the feature coding layer and the pooling layer in sequence;

and inputting the coding vector output by the pooling layer into the linear layer, and taking the output of the linear layer as a prediction attribute value of the drug molecule to be detected.

In one possible implementation manner, the feature encoding layer includes N layers of feature encoders with the same structure, which are sequentially stacked, where N is a positive integer; the training process of the molecular attribute prediction network comprises the following steps:

inputting the second splicing matrix as an input feature into a first layer feature encoder of the feature encoding layer;

and taking the output of the last layer of feature encoder as the output feature of the feature encoding layer.

In one possible implementation manner, each layer of feature encoder comprises a multi-head attention layer and a feedforward neural network layer; the training process of the molecular attribute prediction network comprises the following steps:

performing feature splicing processing on the output features of each head structure to obtain combined features;

performing linear transformation processing on the combined features based on a fourth linear transformation matrix to obtain output features of the multi-head attention layer;

In another aspect, a computer device is provided, the device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded and executed by the processor to implement the above-mentioned drug molecule property determination method.

In another aspect, a storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the above-mentioned drug molecule property determination method.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium, the computer program code being read by a processor of a computer device from the computer readable storage medium, the computer program code being executed by the processor to cause the computer device to perform the above-mentioned drug molecule property determination method.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the process of drug research and development, a novel drug molecule attribute prediction scheme is provided in the embodiments of the present application, and when predicting drug molecule attributes, the scheme obtains three-dimensional structure information of a drug molecule to be detected, where the three-dimensional structure information of the drug molecule gives position distribution of each atom in the drug molecule in a three-dimensional space, and the spatial structure of the drug molecule can affect the properties of the drug molecule, so that the drug molecule attributes can be accurately predicted based on the three-dimensional structure information of the drug molecule, and further, the discovery speed of a new candidate drug can be improved and the research and development cost can be reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of a drug development process provided in an embodiment of the present application;

fig. 2 is a schematic diagram of an implementation environment related to a method for determining a drug molecule property provided in an embodiment of the present application;

FIG. 3 is a flow chart of a method for determining a drug molecule property provided in an embodiment of the present application;

FIG. 4 is a three-dimensional block diagram of a molecule provided in an embodiment of the present application;

FIG. 5 is a three-dimensional structure obtained by subjecting the three-dimensional structure shown in FIG. 4 to random rotational and translational transformations;

FIG. 6 is a two-dimensional structural view of a benzene ring according to an embodiment of the present application;

FIG. 7 is a flow chart of a method for determining a drug molecule property provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a molecular property prediction network according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a feature encoding layer according to an embodiment of the present application;

FIG. 10 is a graph showing the results of an experiment provided in the examples of the present application;

FIG. 11 is a schematic representation of another experimental result provided in the examples of the present application;

fig. 12 is a schematic structural diagram of a drug molecule property determination apparatus provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. At least one of these means one or more, and for example, at least one molecule may be an integer of one or more, such as one molecule, two molecules, or three molecules. The plurality means two or more, and for example, the plurality of molecules may be two molecules, three molecules, or any integer of two or more.

The embodiment of the application provides a method and a device for determining drug molecule attributes and a storage medium. The method relates to the field of Artificial Intelligence (AI).

The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In detail, the artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning. The deep learning is a core part of machine learning, and generally includes techniques such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, teaching learning, and the like. The deep learning is a new research direction in the field of machine learning.

Some noun terms or abbreviations referred to in the embodiments of the present application are described below.

The molecular properties of the drug are as follows: including the properties of Absorption (adsorption), Distribution (Distribution), Metabolism (Metabolism), Excretion (Excretion), Toxicity (Toxicity) and the like of drug molecules.

Figure 1 shows the main flow of drug development including target identification and validation, compound screening and lead discovery, preclinical research and clinical implementation. Wherein, after completing target identification and verification, the candidate drug is required to be screened. In the screening process, the properties of absorption, distribution, metabolism, excretion, toxicity and the like of the drug molecules can be predicted through a drug molecule property prediction algorithm, so that research and development personnel can be helped to screen the drug molecules, the research and development efficiency can be greatly improved, and the medicine research and development cost can be reduced.

Simplified Molecular Input Line Entry Specification (SMILES): a specification for specifying the structure of a molecule by using American Standard Code for Information Interchange (ASCII) character strings.

The SMILES expression may be represented by a string of characters describing a three-dimensional chemical structure, for example, the SMILES expression for cyclohexane (C6H12) is C1 CCCCCCC 1, i.e., C1 CCCCCCC 1 is represented as cyclohexane. The SMILES expression for ethyl acetate is CC (═ O) OCC, i.e., CC (═ O) OCC is represented as ethyl acetate.

The following describes an implementation environment related to the drug molecule property determination scheme provided in the embodiments of the present application.

Among other things, drug molecule property determination is also referred to herein as drug molecule property Prediction (molecular property Prediction).

Referring to fig. 2, the implementation environment includes: a first computer device 201 and a second computer device 202.

Illustratively, the first computer device 201 may be used to train a molecular property prediction network, and the second computer device 202 may utilize the molecular property prediction network trained by the first computer device 201 to predict drug molecular properties. Of course, the first computer device 201 and the second computer device 202 may also be the same device, that is, the device may predict the drug molecule attribute based on the neural network model after the neural network model is trained, which is not specifically limited in this embodiment of the present application.

In an example, the first computer device 201 is a server, and the second computer device 202 is a terminal.

Exemplarily, in such a scenario, a terminal is configured with a relevant application, the terminal transmits a SMILES expression of a drug molecule to be tested to a server through the relevant application, and then the server obtains three-dimensional structure information, two-dimensional structure information, atomic characteristics and chemical bond characteristics of the drug molecule to be tested based on the received SMILES expression, and predicts the drug molecule attribute by using a drug molecule attribute prediction algorithm (i.e., calling a molecule attribute prediction network) provided in the embodiment of the present application, and feeds back a predicted value output by the molecule attribute prediction network to the terminal through the relevant application, and then the terminal presents the predicted result to a user.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. In addition, the terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited specifically herein.

Second, the drug molecule property prediction scheme provided in the embodiment of the present application can also be independently completed locally by the terminal. That is, only the terminal may be included in the implementation environment shown in fig. 2.

Exemplarily, in such a scenario, the terminal is configured with a relevant application, the SMILES expression of the drug molecule to be detected of the terminal obtains three-dimensional structure information, two-dimensional structure information, atomic characteristics and chemical bond characteristics of the drug molecule to be detected, and predicts the drug molecule attribute by using the drug molecule attribute prediction algorithm (i.e., calling a molecule attribute prediction network) provided by the embodiment of the present application, and presents the prediction result to the user.

In summary, the drug molecule attribute prediction scheme provided in the embodiment of the present application may be executed jointly by the terminal and the server, or may be executed independently by the terminal, which is not specifically limited in the embodiment of the present application.

Based on the above implementation environment, the scheme for predicting the drug molecule property provided in the embodiment of the present application includes: a Data enhancement (DA) method based on molecular three-dimensional structure information is provided, so that the accuracy of molecular attribute prediction is improved; in addition, a Transformer model in the natural language processing field is introduced, namely, a novel method for applying the Transformer model in the molecular attribute prediction field is provided. In other words, in the technical scheme, on one hand, three-dimensional structure information of molecules is introduced, a data enhancement method based on the three-dimensional structure information of the molecules is provided, and accuracy of molecular attribute prediction is improved. On the other hand, a Transformer model in the natural language processing field is introduced, and a new method for applying the Transformer model in the molecular attribute prediction field is provided, so that the strong expression capability of the Transformer model is facilitated, and the accuracy of the molecular attribute is further improved.

It should be noted that the drug property prediction scheme provided by the embodiment of the present application can be applied to a drug development process to predict the drug property of a drug molecule, so as to improve the speed of discovering a new candidate drug and reduce the development cost.

The drug property prediction schemes provided in the examples of the present application are explained in detail by the following examples.

Fig. 3 is a flowchart of a method for determining a drug molecule property according to an embodiment of the present disclosure. The method is executed by a computer device, and the computer device can only comprise a terminal, and can also comprise the terminal and a server. Referring to fig. 3, a method flow provided by the embodiment of the present application includes:

301. acquiring a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected.

In the embodiments of the present application, the drug molecule to be detected refers to a drug molecule whose molecular property is to be predicted.

Illustratively, the text string referred to above refers to the SMILES expression. The SMILES expression describes a three-dimensional chemical structure with a string of characters, which can transform the chemical structure of a molecule into a spanning tree. In carrying out the conversion, it is generally necessary to remove the hydrogen atom and also to open the ring. In making the representation, the atoms of the detached bond end are usually indicated by a number, and the branches are written in parentheses.

To summarize, the transformation rules are: hydrogen atoms are omitted, single bonds do not necessarily represent vicinities, double bonds are represented by ═ and triple bonds are represented by #; the chemical structure is broken down in the single chain concept, with the side chains placed in small brackets and immediately following the connecting atoms.

302. And acquiring the three-dimensional structure information of the drug molecules to be detected according to the text character strings of the drug molecules to be detected.

The embodiment of the application provides a data enhancement method based on three-dimensional structure information of drug molecules.

Illustratively, the three-dimensional structure information of the drug molecule to be detected is a three-dimensional structure coordinate of the drug molecule to be detected.

3021. And acquiring the three-dimensional structure coordinates of the drug molecules to be detected according to the text character strings of the drug molecules to be detected.

As an example, in the embodiment of the present application, the three-dimensional structure coordinates (x, y, z) of each atom in the drug molecule to be detected are obtained through the software RDKit, and the obtaining process is as follows. Namely, the method for acquiring the three-dimensional structure coordinates of the drug molecules to be detected according to the text character strings comprises the following steps:

step a, acquiring a chemical structural formula of a drug molecule to be detected according to a text character string of the drug molecule to be detected.

In this step, the molecular representation of the drug molecule to be detected is obtained through the SMILES expression of the drug molecule to be detected according to the reciprocal process of the transformation rule shown in the above step 301, and hydrogen atoms are supplemented.

And b, determining M three-dimensional structures with different configurations according to the chemical structural formula of the drug molecule to be detected.

Illustratively, the value of M is 10, i.e. a three-dimensional structure with 10 different configurations (former) is obtained. The empty configuration of the molecule refers to the geometrical shape of the spatial distribution of various groups or atoms in the molecule. The atoms in the molecule are not stacked together disorderly, but are combined into a whole according to a certain rule, so that the molecule presents a certain geometric shape (namely configuration) in space.

In order to avoid very similar configurations, in one possible implementation, the following conditions are also satisfied between two three-dimensional structures having different configurations: RMSD (Root Mean Squared Error) is greater than the first threshold. Wherein the first threshold may be 0.5 angstroms in sizeThis is not particularly limited in the embodiments of the present application.

And c, performing energy minimization treatment on the M three-dimensional structures in the target molecular force field.

As an example, the target Molecular Force Field is MMFF94(Merck Molecular Force Field 94), which is not specifically limited in this embodiment.

Taking the value of M as 10 as an example, the MMFF94 is used in the embodiment of the present application to perform force field optimization on the three-dimensional structure with 10 different configurations obtained in the step b. Also, energy minimization was performed using the MMFF94 force field for three-dimensional structures with different configurations.

D, determining a target three-dimensional structure with the minimum energy in the M three-dimensional structures; and removing hydrogen atoms in the target three-dimensional structure to obtain the three-dimensional structure of the drug molecule to be detected.

Taking the value of M as 10 as an example, in the embodiment of the present application, the three-dimensional structure with the lowest energy (referred to as a target three-dimensional structure herein) is selected from the optimized three-dimensional structures with 10 configurations as the three-dimensional structure of the drug molecule to be detected, and the hydrogen atom in the three-dimensional structure is removed.

And e, acquiring the three-dimensional coordinates of each atom in the drug molecules to be detected under the three-dimensional structure, and acquiring the three-dimensional structure coordinates of the drug molecules to be detected.

After obtaining the three-dimensional structure coordinates of the drug molecules to be tested, the following step 3022 is further included before inputting into the neural network model to achieve data enhancement.

3022. And on the premise that the three-dimensional structure shape of the drug molecule to be detected is not changed, transforming the current three-dimensional structure coordinate of the drug molecule to be detected to obtain a three-dimensional structure coordinate matrix of the drug molecule to be detected.

Illustratively, the transformation process includes, but is not limited to, a random rotation process and a translation process.

Correspondingly, the current three-dimensional structure coordinate of the drug molecule to be detected is subjected to transformation processing, and the transformation processing comprises the following steps:

acquiring a random rotation matrix and a translation transformation matrix; respectively carrying out random rotation and translation transformation on the three-dimensional structure of the drug molecules to be detected according to the random rotation matrix and the translation transformation matrix on the premise that the three-dimensional structure shape of the drug molecules to be detected is kept unchanged to obtain a three-dimensional structure coordinate matrix; wherein, the three-dimensional structure coordinate matrix comprises a new three-dimensional structure coordinate of the drug molecule to be detected.

In other words, this step randomly rotates and translates the three-dimensional structure determined in step 3021 using a random rotation matrix and a translation matrix, respectively, and ensures that the shape of the three-dimensional structure of the drug molecule to be tested remains unchanged.

In which FIG. 4 shows the three-dimensional structure of the molecule Norbortide (C33H25N3O3), which after random rotation and translation gives the results shown in FIG. 5. As can be seen from a comparison of fig. 4 and 5, the three-dimensional structure coordinates of the molecule are changed, but the three-dimensional structure shape of the molecule remains unchanged.

303. And determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information of the drug molecules to be detected.

In one possible implementation manner, the three-dimensional structure information of the drug molecules to be detected is input into a molecular attribute prediction network, and the molecular attribute prediction network is called to determine the drug forming attributes of the drug molecules to be detected.

Namely, the method for determining the drug forming property of the drug molecule to be detected according to the three-dimensional structure information comprises the following steps:

inputting the three-dimensional structure coordinate matrix of the drug molecule to be detected into a molecule attribute prediction network to obtain a prediction attribute value output by the molecule attribute prediction network; wherein, the output prediction attribute value is used for indicating the drug forming attribute of the drug molecule to be detected.

According to the method provided by the embodiment of the application, in the process of drug research and development, a new drug molecule attribute prediction scheme is provided, and when the drug molecule attribute is predicted, three-dimensional structure information of a drug molecule to be detected is obtained, wherein the three-dimensional structure information of the drug molecule gives the position distribution of each atom in the drug molecule in a three-dimensional space, and the spatial structure of the drug molecule can influence the property of the drug molecule, so that the drug molecule attribute can be accurately predicted based on the three-dimensional structure information of the drug molecule, and further the discovery speed of a new candidate drug can be improved and the research and development cost can be reduced.

In one embodiment, the three-dimensional structure information of the drug molecule to be detected is obtained through the steps 3021 and 3022, and in addition, the two-dimensional structure information of the drug molecule to be detected is obtained in the embodiment of the present application. Illustratively, the two-dimensional structure information is a contiguous matrix of the molecular two-dimensional structure graph. That is, the step 302 further includes: and acquiring the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected according to the text character strings of the drug molecules to be detected.

3023. Acquiring an adjacency matrix of a two-dimensional structure chart of the drug molecules to be detected according to the text character strings of the drug molecules to be detected; and (3) carrying out normalization processing on the adjacent matrix of the two-dimensional structure chart of the drug molecules to be detected to obtain the normalized adjacent matrix of the drug molecules to be detected.

Illustratively, the SMILES expression may be imported and converted into a two-dimensional structure diagram by most molecular editing software. The conversion into the two-dimensional Structure Diagram may use a Structure Diagram Generation Algorithm (SDGA), which is not specifically limited in this embodiment.

In a possible implementation manner, normalizing the adjacency matrix of the two-dimensional structure diagram to obtain a normalized adjacency matrix includes: transforming the value of the diagonal element of the adjacency matrix from a first value to a second value to obtain a new adjacency matrix; and normalizing the new adjacency matrix according to the rows to obtain a normalized adjacency matrix. The value of the first value may be 0, and the value of the second value may be 1, which is not specifically limited in this embodiment of the application.

As an example, take a benzene ring (SMILES: c1cccc 1) as an example, wherein FIG. 6 shows a two-dimensional structure of the benzene ring, containing 6 carbon atoms, and the adjacency matrix thereof is as follows:

adding atom self-connection (the atoms are connected) on the basis of the adjacent matrix, namely changing the original value on the diagonal line of the adjacent matrix into 0 to 1 to obtain the following matrix (left matrix). Finally, in order to facilitate data processing, the matrix is normalized according to rows to obtain a normalized adjacent matrix. Illustratively, the normalization process is to convert each matrix element to a decimal between 0 and 1. The normalized adjacency matrix is as follows.

3024. And acquiring the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected according to the text character strings of the drug molecules to be detected.

For this step, according to the text string of the drug molecule to be detected, the atomic characteristics and the chemical bond characteristics of the drug molecule to be detected can be obtained through Rdkit software, which is not specifically limited in the embodiment of the present application.

Illustratively, the above step 303 may be replaced by: and determining the drug forming property of the drug molecules to be detected according to the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected.

In one possible implementation manner, the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected are input into a molecular attribute prediction network, and the molecular attribute prediction network is called to determine the drug formation attributes of the drug molecules to be detected. Namely, the method for determining the drug forming property of the drug molecule to be detected according to the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics comprises the following steps:

3031. and performing characteristic splicing treatment on the three-dimensional structure coordinate matrix, the normalized adjacency matrix, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected to obtain a first splicing matrix.

The concat function may be used to perform feature concatenation, which is not specifically limited in this embodiment of the present application. The resulting mosaic matrix is also referred to herein as the first mosaic matrix.

3032. Inputting the first splicing matrix of the drug molecules to be detected into a molecular attribute prediction network to obtain a prediction attribute value output by the molecular attribute prediction network; wherein, the output prediction attribute value is used for indicating the drug forming attribute of the drug molecule to be detected.

Illustratively, the drug-forming properties of a drug molecule include, but are not limited to: absorption, distribution, metabolism, excretion, toxicity, and the like. The output prediction attribute value can give the prediction value of each drug forming attribute of the drug molecules to be detected. Assuming that the attribute value of each drug property ranges from 0 to 10, taking toxicity as an example, 0 represents no toxicity, and 10 represents the highest toxicity.

Fig. 7 shows one possible structure of a molecular property prediction network. Referring to fig. 7, the molecular property prediction network includes a feature encoding Layer 701, a pooling Layer 702, and a Linear Layer (Linear Layer) 703.

Illustratively, the feature coding layer 701 introduces a Transformer model in the field of natural language processing, that is, the embodiment of the present application provides a new method for applying the Transformer model in the field of molecular property prediction. In the embodiment of the application, the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected are obtained, and the characteristics are spliced to be used as the input of the characteristic coding layer 701, so that the method greatly improves the prediction accuracy of the molecular properties.

In a possible implementation manner, the pooling layer 702 may be an average pooling layer, and the linear layer 703 may include several linear layers, which is not specifically limited in this embodiment.

Illustratively, three-dimensional structure coordinates, a normalized adjacency matrix, atomic features and chemical bond features of the drug molecules to be tested are spliced and input into the neural network model, and after the input data passes through the feature coding layer 701 of the neural network model, the atomic codes of the drug molecules to be tested (the atomic peripheral bond features are coded on the atomic codes by the neural network model) are obtained.

According to the method provided by the embodiment of the application, in the process of drug research and development, a new drug molecule attribute prediction scheme is provided, when the drug molecule attribute is predicted, three-dimensional structure information, two-dimensional structure information, atomic characteristics and chemical bond characteristics of a drug molecule to be detected can be obtained, the drug molecule attribute can be accurately predicted by obtaining information in multiple aspects, and then the discovery speed of a new candidate drug can be improved and the research and development cost can be reduced. In addition, a Transformer model in the field of natural language processing is introduced, a new method for applying the Transformer model in the field of molecular attribute prediction is provided, strong expression capability of the Transformer model is facilitated, and accuracy of the molecular attributes is further improved.

Fig. 8 is a flowchart of a method for determining a drug molecule property according to an embodiment of the present disclosure. The method is executed by a computer device, and the computer device can only comprise a terminal, and can also comprise the terminal and a server. Aiming at the problem of drug molecule attribute prediction in the drug research and development process, the embodiment of the application provides a drug molecule attribute prediction scheme, which can efficiently predict the properties of ADMET (adsorption, Distribution, Metabolism, Excretion, Toxicity, Absorption, Distribution, Metabolism, Excretion and Toxicity) and the like of drug molecules and help drug researchers to screen and design the drug molecules. Referring to fig. 8, a method flow provided by the embodiment of the present application includes:

801. acquiring a training data set, wherein the training data set comprises sample molecules and attribute labels matched with the sample molecules; acquiring a three-dimensional structure coordinate matrix, a normalized adjacency matrix, an atomic characteristic and a chemical bond characteristic of a sample molecule; and performing characteristic splicing treatment on the three-dimensional structure coordinate matrix, the normalized adjacency matrix, the atomic characteristics and the chemical bond characteristics of the sample molecules to obtain a second spliced matrix.

The step can be executed with reference to the step 302 described above, which is not described herein again.

802. And taking the second splicing matrix as the input of the initial neural network, and taking the attribute label matched with the sample molecule as the output of the initial neural network to train the initial neural network.

Wherein the attribute label of the sample molecule is the true value of the druggy attribute of the sample molecule.

As can be seen from fig. 7, in the scheme for predicting the drug molecular property, the forward process of model training (fed forward) includes the following steps:

8021. and acquiring a three-dimensional structure coordinate matrix of the sample molecules, the characteristics of each atom in the sample molecules, the characteristics of each chemical bond in the sample molecules and an adjacent matrix of a two-dimensional structure chart of the sample molecules according to the SMILES expression of the sample molecules.

8022. Carrying out random rotation and translation transformation on the three-dimensional structure of the sample molecule to realize data enhancement; normalizing the adjacent matrix of the two-dimensional structure chart of the sample molecules; and performing characteristic splicing on the processed three-dimensional structure coordinate matrix, the adjacent matrix of the two-dimensional structure chart, the characteristics of each atom in the sample molecule and the characteristics of each chemical bond in the sample molecule.

8023. The spliced matrix (referred to as a second splicing matrix in the present document) is used as input data of the neural network model, the input data is input into the neural network model, and the coding vectors of the sample molecules are obtained through the feature coding layer 701 and the pooling layer 702 in the neural network model.

The neural network model involved in this step is the initial neural network involved in step 802 described above.

8024. The encoding vector of the sample molecule passes through the linear layer 703 to obtain the final output of the neural network model, and the output value is the predicted value of the patent drug property of the sample molecule.

803. Obtaining a difference value between a predicted attribute value output by the initial neural network and an attribute label of a sample molecule based on a target loss function; and responding to the difference value larger than the second threshold value, and repeatedly and iteratively updating the network parameters of the initial neural network until the difference value is not larger than the second threshold value to obtain the molecular attribute prediction network.

In the model training process, a loss function is usually used to determine whether the model converges. The loss function may be a cross entropy loss function, which is not specifically limited in this embodiment of the present application. Typically, a loss function is used to calculate the degree of difference between the predicted value of the model output and the attribute label.

When the predicted value of the model output is determined to be matched with the attribute label based on the loss function, for example, when the difference degree between the two is smaller than a second threshold value, the two are considered to be matched, and the training is finished. Or, the training may also be ended after the number of training iterations reaches a preset number, which is not specifically limited in the embodiment of the present application.

Illustratively, the embodiment of the application compares a predicted value and a true value of the druggy property of the sample molecule obtained in a Forward calculation process to obtain a difference value, takes the difference value as a Loss Function (Loss Function) of the neural network model, calculates the gradient of each network layer in a backward calculation (Back Forward) process, and updates the network parameters of the neural network model by using an adaptive motion Estimation (Adam) algorithm.

As an example, in the embodiment of the present application, an Encoder (encoding module) portion of a transform model is used in the feature encoding layer 701 portion, where a schematic structural diagram of the Encoder is shown in fig. 9.

That is, the Encoder includes N layers of feature encoders having the same structure, which are sequentially stacked, where N is a positive integer. When feature encoding processing is performed, the embodiment of the present application includes: inputting the second splicing matrix as an input characteristic into a first layer characteristic Encoder of the Encoder; sequentially coding the input features through each layer of feature coder which is stacked until the last layer of feature coder; wherein, the input of the feature encoder of the upper layer is used as the output of the feature encoder of the lower layer; and taking the output of the last layer of feature Encoder as the output feature of the Encoder.

In another possible implementation manner, the attention mechanism may be combined into a natural language processing task, a network model combined with the attention mechanism highly focuses on feature information of a specific target in a training process, and network parameters can be effectively adjusted for different targets to mine more hidden feature information.

The Attention (Attention) mechanism is directed to research that has been derived from human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as an attention mechanism. Attention mechanism is a brain signal processing mechanism unique to human vision. Human vision obtains a target area needing important attention, namely an attention focus, by rapidly scanning a global image, and then more attention resources are put into the area to obtain more detailed information of the target needing attention and suppress other useless information.

In summary, the attention mechanism has two main aspects: firstly, determining which part of the input needs to be concerned; the second is to allocate limited information processing resources to important parts. The attention mechanism in deep learning is similar to the selective visual attention mechanism of human beings in nature, and the core goal is to select more critical information for the current task from a plurality of information.

As an example, each layer of feature encoder includes a multi-head attention layer and a feedforward neural network layer; that is, the feature encoder uses a multi-head attention mechanism. Correspondingly, carry out coding process to the input characteristic in proper order through each layer characteristic encoder that stacks the setting, include:

8031. for an ith head structure of a multi-head attention layer contained in a jth layer feature encoder, acquiring a first linear transformation matrix, a second linear transformation matrix and a third linear transformation matrix corresponding to the ith head structure; wherein, the values of i and j are positive integers, and j is more than or equal to 1 and less than or equal to N.

Herein, the first linear transformation matrix, the second linear transformation matrix and the third linear transformation matrix may be respectively denoted by symbol W_i ^Q、W_i ^KAnd W_i ^VAnd (4) indicating.

8032. Performing linear transformation processing on the input characteristics of the ith head structure according to the first transformation matrix, the second transformation matrix and the third transformation matrix respectively to obtain a query sequence, a key sequence and a value sequence of the ith head structure in sequence; and acquiring the output characteristics of the ith head structure according to the query sequence, the key sequence and the value sequence of the ith head structure.

Firstly, the input characteristics of the ith head structure are respectively related to W_i ^Q、W_i ^KAnd W_i ^VMatrix multiplication operation is carried out to obtain the query sequence Q of the ith head structure in turn_iThe bond sequence K_iSum value sequence V_i。

Then, based on the query sequence Q of the ith head structure_iThe bond sequence K_iSum value sequence V_iAnd calculating the output characteristic Zi of the ith head structure.

Wherein the content of the first and second substances,

d_kdenotes the bond sequence K_iOf (c) is calculated.

8033. And carrying out characteristic splicing treatment on the output characteristics of each head structure to obtain combined characteristics.

Wherein, the feature splicing processing can be carried out by a concat () method to obtain the combined feature Z.

Expressed by the calculation formula: combined characteristic Z ═ Concat (head)₁,...,headm)W^O(ii) a Wherein the value of m is the number of the head structures.

8034. And performing linear transformation processing on the combined characteristics based on the fourth linear transformation matrix to obtain the output characteristics of the multi-head attention layer.

The fourth linear transformation matrix may be denoted by the symbol W in this context^OIs referred to, wherein, W_i ^Q、W_i ^KAnd W_i ^VAnd W^OThe random initialization may be performed and obtained through training, which is not specifically limited in this embodiment of the present application.

8035. And inputting the output characteristics of the multi-head attention layer into the feedforward neural network layer, and taking the output of the feedforward neural network layer as the input characteristics of the j + 1-th layer characteristic encoder.

Illustratively, the feedforward neural network may perform two linear transformations and one nonlinear transformation on the output characteristic, which is not particularly limited in the embodiments of the present application.

804. Acquiring a text character string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected.

This step can be performed with reference to step 301 described above.

805. Acquiring a three-dimensional structure coordinate matrix, a normalized adjacency matrix, an atomic characteristic and a chemical bond characteristic of the drug molecule to be detected according to the text character string of the drug molecule to be detected; and performing characteristic splicing treatment on the three-dimensional structure coordinate matrix, the normalized adjacency matrix, the atomic characteristics and the chemical bond characteristics of the drug molecules to be detected to obtain a first splicing matrix.

This step can be performed with reference to step 302 described above.

806. Inputting the first splicing matrix of the drug molecules to be detected into the trained molecular attribute prediction network to obtain a prediction attribute value output by the molecular attribute prediction network; wherein, the output prediction attribute value is used for indicating the drug forming attribute of the drug molecule to be detected.

This step can be performed with reference to step 303 described above.

The method provided by the embodiment of the application has at least the following beneficial effects:

the embodiment of the application provides a data enhancement method based on molecular three-dimensional structure information, so that the accuracy of molecular attribute prediction is improved; in addition, a Transformer model in the natural language processing field is introduced, namely, a novel method for applying the Transformer model in the molecular attribute prediction field is provided. In other words, in the technical scheme, on one hand, three-dimensional structure information of molecules is introduced, a data enhancement method based on the three-dimensional structure information of the molecules is provided, and accuracy of molecular attribute prediction is improved. On the other hand, a Transformer model in the natural language processing field is introduced, and a new method for applying the Transformer model in the molecular attribute prediction field is provided, so that the strong expression capability of the Transformer model is facilitated, and the accuracy of the molecular attribute is further improved.

To sum up, the three-dimensional structure information, the two-dimensional structure information, the atomic characteristics and the bond characteristics of the drug to be detected are obtained and spliced to be used as input data of the Transformer model, and the method greatly improves the prediction accuracy of the drug molecular properties.

Illustratively, the experimental comparison of the drug molecule property prediction scheme provided in the examples of the present application and the drug molecule property prediction scheme provided in the related art is performed on the standard data set MoleculeNet, and the obtained experimental results are shown in fig. 10 and fig. 11.

Among them, the larger the value of ROC (Receiver Operating Characteristic Curve) -AUC (Area Under the ROC Curve and enclosed by coordinate axes) is, the better it is, and the smaller the value of RMSE (root mean Square Error) is, the better it is.

Fig. 10 shows the experimental results of different prediction schemes on a Classification (Classification) dataset, wherein the dataset is segmented by a Scaffold method, and 3 different algorithms are respectively RF on Morgan (random forest algorithm based on Morgan molecular fingerprints), D-MPNN (graph neural network) and the prediction scheme provided by the embodiment of the present application. Fig. 11 shows the experimental results of the 3 algorithms on the Regression (Regression) data set, and similarly, it can be clearly seen that the experimental effect of the prediction scheme provided by the embodiment of the present application on the three Regression data sets is also better than that of other prediction schemes.

In the embodiment of the present invention, a data enhancement method based on three-dimensional structure information of a drug molecule is applied to a Transformer model, and in an actual implementation process, the Transformer model may be replaced with another neural network model (for example, a graph neural network). In addition to using the average pooling layer as the atomic information Aggregator (agglegrator), the average pooling layer may be replaced with an Aggregator such as a maximum pooling layer (maxpoling) or a Set2Set during actual implementation, which is not particularly limited in the embodiments of the present application.

Fig. 12 is a schematic structural diagram of a drug molecule property determination apparatus according to an embodiment of the present application. Referring to fig. 12, it includes:

a first obtaining module 1201 configured to obtain a text string of a drug molecule to be detected; the text character string is used for describing a chemical structural formula of the drug molecule to be detected;

a second obtaining module 1202, configured to obtain three-dimensional structure information of the drug molecule to be detected according to the text character string;

a predicting module 1203 configured to determine a drug forming property of the drug molecule to be detected according to the three-dimensional structure information.

According to the device provided by the embodiment of the application, in the process of drug research and development, a new drug molecule attribute prediction scheme is provided, and when the drug molecule attribute is predicted, three-dimensional structure information of a drug molecule to be detected can be obtained, wherein the three-dimensional structure information of the drug molecule gives the position distribution of each atom in the drug molecule in a three-dimensional space, and the spatial structure of the drug molecule can influence the property of the drug molecule, so that the drug molecule attribute can be accurately predicted based on the three-dimensional structure information of the drug molecule, and the discovery speed of a new candidate drug can be improved, and the research and development cost can be reduced.

In a possible implementation manner, the second obtaining module includes:

a first obtaining unit configured to obtain three-dimensional structure coordinates of the drug molecule to be detected according to the text character string;

In a possible implementation manner, the second obtaining module further includes:

the second acquisition unit is configured to acquire an adjacency matrix of the two-dimensional structure chart of the drug molecules to be detected according to the text character string;

and the second processing unit is configured to perform normalization processing on the adjacent matrix of the two-dimensional structure chart to obtain a normalized adjacent matrix.

In one possible implementation, the prediction module is configured to:

performing feature splicing processing on the three-dimensional structure coordinate matrix, the normalized adjacency matrix, the atomic features and the chemical bond features to obtain a first splicing matrix;

inputting the first splicing matrix into a molecular attribute prediction network to obtain a prediction attribute value output by the molecular attribute prediction network; the prediction attribute value is used for indicating the drug forming attribute of the drug molecule to be detected.