Similarity-based method, system, terminal and readable storage medium for predicting occurrence frequency of side effects of new drug

文档序号：1965119 发布日期：2021-12-14 浏览：13次中文

阅读说明：本技术 一种基于相似性的新药副作用发生频率预测方法、系统、终端及可读存储介质 (Similarity-based method, system, terminal and readable storage medium for predicting occurrence frequency of side effects of new drug ) 是由王建新赵皓晨郑凯赵其昌于 2021-09-15 设计创作，主要内容包括：本发明公开了一种基于相似性的新药副作用发生频率预测方法、系统、终端及可读存储介质,其获取药物、副作用相似性信息和已知的药物副作用发生频率信息,再生成药物相似性向量和副作用相似性向量；进而针对每种类型的药物相似性与副作用相似性生成一个交互图并通过神经网络捕获药物与副作用的交互信息；使用多层感知机分别对药物和副作用相似性向量进行编码产生药物嵌入和副作用嵌入；最后将药物嵌入、副作用嵌入和药物-副作用交互嵌入拼接,再利用多层感知机对药物的副作用和副作用的发生频率进行预测。本发明不依赖于已知的药物副作用发生频率信息,能够对新药物的副作用发生频率进行预测,填补了当前新药副作用发生频率预测技术空缺。(The invention discloses a method, a system, a terminal and a readable storage medium for predicting the occurrence frequency of side effects of a new medicine based on similarity, which are used for acquiring the medicine, side effect similarity information and known medicine side effect occurrence frequency information and then generating a medicine similarity vector and a side effect similarity vector; generating an interaction graph aiming at the similarity of each type of medicine and the similarity of side effects and capturing the interaction information of the medicine and the side effects through a neural network; respectively encoding the drug and the side effect similarity vector by using a multilayer perceptron to generate drug embedding and side effect embedding; and finally, embedding and splicing the medicines, embedding side effects and medicine-side effect interaction, and predicting the side effects of the medicines and the occurrence frequency of the side effects by using a multilayer perceptron. The method does not depend on the known side effect occurrence frequency information of the medicine, can predict the side effect occurrence frequency of the new medicine, and fills the gap of the current side effect occurrence frequency prediction technology of the new medicine.)

1. A method for predicting the occurrence frequency of side effects of a new drug based on similarity is characterized in that: the method comprises the following steps:

step 1: constructing a training data set based on known drug-side effect frequency information;

step 2: acquiring drug similarity information and side effect similarity information;

and step 3: constructing a similarity vector of each medicament and a similarity vector of each side effect based on the medicament similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space with the same dimension to obtain feature vectors of the medicaments and the side effects;

and 4, step 4: constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects;

and 5: building a network architecture of a medicine side effect occurrence frequency prediction model, and performing network training by using the training data set and medicine and side effect data thereof to obtain a trained medicine side effect occurrence frequency prediction model;

wherein, the known drug-side effect frequency information in the training data set is a label, and the drug and side effect data in the step 5 at least comprise a drug-side effect pair interaction diagram;

step 6: and (4) acquiring the medicine similarity information of the new medicine aiming at the new medicine to be predicted, processing the medicine data according to the modes of the step (3) and the step (4), and inputting the medicine data into the trained medicine side effect occurrence frequency prediction model to obtain a new medicine side effect occurrence frequency prediction result.

2. The method of claim 1, wherein: if the known drug-side effect frequency information in the step 1 comprises data coded according to whether the frequency scores of the drug and the side effects are known or not, the drug side effect occurrence frequency prediction model in the step 5 is used for predicting whether the new drug and the side effects have an association relation or not;

the training process of the drug side effect occurrence frequency prediction model is as follows:

step 5.1: building a network architecture of a medicine side effect occurrence frequency prediction model based on a deep convolutional neural network and a multilayer perceptron;

the method comprises the steps of utilizing a deep convolutional neural network to perform feature extraction on a drug-side effect interaction diagram to obtain drug-side effect interaction embedded data, and utilizing a multilayer perceptron to perform feature extraction on a drug feature vector and a side effect feature vector respectively to obtain drug embedded data and side effect embedded data; splicing the drug-side effect interaction embedded data, the drug embedded data and the side effect embedded data, and inputting the data to a multilayer perceptron to obtain a prediction score of a drug-side effect association pair;

step 5.2: and taking the drug characteristic vector, the side effect characteristic vector and the drug-side effect pair interaction diagram as network input, and taking the known drug-side effect frequency information in the training data set as a label to perform network training.

3. The method of claim 1, wherein: if the known drug-side effect frequency information in step 1 includes data encoded with frequency score values of the drug and the side effect, the drug side effect occurrence frequency prediction model in step 5 is used to predict frequency data of the new drug and the side effect based on the encoding rule;

the training process of the drug side effect occurrence frequency prediction model is as follows:

step 5-1: building a network architecture of a drug side effect occurrence frequency prediction model;

the method comprises the steps of utilizing a deep convolutional neural network to perform feature extraction on a drug-side effect interaction diagram to obtain drug-side effect interaction embedded data, and utilizing a multilayer perceptron to perform feature extraction on a drug feature vector and a side effect feature vector respectively to obtain drug embedded data and side effect embedded data; splicing the drug-side effect interaction embedded data, the drug embedded data and the side effect embedded data, and inputting the spliced vectors into a multi-layer perceptron to obtain frequency data of a drug-side effect association pair based on a coding rule;

step 5-2: and the drug characteristic vector, the side effect characteristic vector and the drug-side effect pair interaction diagram are used as network input, and the new drug and side effect frequency data based on the coding rule in the training data set are used as labels for network training.

4. The method of claim 1, wherein: if the known drug-side effect frequency information in step 1 comprises an adjacent matrix DMA coded by whether the frequency scores of the drug and the side effect are known or not and an adjacent matrix DMF coded by the frequency score value of the drug and the side effect; the drug side effect occurrence frequency prediction model in the step 5 is used for predicting whether the new drug has a correlation with the side effect, and further predicting the frequency data of the new drug and the side effect based on the coding rule aiming at the new drug and the side effect with the correlation;

the training process of the drug side effect occurrence frequency prediction model is as follows:

s5-1: building a network architecture of a medicine side effect occurrence frequency prediction model based on a deep convolutional neural network and a multilayer perceptron;

when the prediction score of the drug-side effect association pair is smaller than a preset judgment threshold, the drug is considered to have no corresponding side effect; when the prediction score of the drug-side effect association pair is larger than or equal to a preset judgment threshold, inputting the spliced backward quantity into a new multilayer perceptron to obtain frequency data of the drug-side effect association pair based on the coding rule according to the fact that the drug has corresponding side effects;

s5-2: and taking the drug characteristic vector, the side effect characteristic vector and the drug-side effect pair interaction diagram as network input, and taking the known drug-side effect frequency information in the training data set as a label to perform network training.

5. The method of claim 1, wherein: the formula of the feature vectors of the drugs and the side effects in the step 3 is as follows:

for drug d_iThe class of similarity feature vectors of (1) is: denotes the drug d_iFeature vectors for class k similarity, P_kIs a linear transfer matrix with dimensions r N_d，N_dIs the number of the types of the drugs in the training set;is a medicine d_iFor the similarity vector of the kth similarity, r is the dimension of the similarity vector after linear transfer matrix transformation;

against side effects s_jThe class of similarity feature vectors of (1) is: indicates side effects s_jFeature vectors for class i similarity, Q_lIs a linear transfer matrix with dimensions r N_s，N_sIs the number of types of side effects in the training set,is a side effect s_jFor class i similarity vectors.

6. The method of claim 1, wherein: the process for constructing the drug-side effect pair interaction diagram in step 4 is as follows:

for drug d_iSimilarity and side effects of the same class of drugs_jThe similarity of side effects of (a) generates a drug-side effect pair interaction diagram, which is represented as: medicine d_iFeature vector ofAnd side effects s_jFeature vector ofThe outer product of (a) is as follows:

in the formula (I), the compound is shown in the specification,is a medicine d_iClass k drug similarity and side effects of (a)_jDrug-side effect pair interaction plot between class i side effect similarities of (i),denotes the drug d_iCharacteristic of the k-th class similarity toThe amount of the compound (A) is,indicates side effects s_jThe feature vector of class i similarity;

the drug-side effect pair interaction map between each drug and each side effect is composed of drug-side effect pair interaction maps generated by the drug and the respective drug similarity categories and the respective side effect similarity categories corresponding to the side effects.

7. The method of claim 1, wherein: the drug similarity information is represented by a drug similarity matrix SMD_Similarity，SMD_Experimental，SMD_Database，SMD_{Text_mining}，SMD_{Combined_score}，SMD_Structure，SMD_Target，SMD_WordWherein the side effect similarity information is a side effect similarity matrix SME_Semantic，SME_WordAnd (4) forming.

8. A system based on the method of any one of claims 1-7, characterized by: the method comprises the following steps:

a training data set construction module for constructing a training data set based on known drug-side effect frequency information;

the medicine similarity information acquisition module is used for acquiring medicine similarity information;

the side effect similarity information acquisition module is used for acquiring side effect similarity information;

the drug and side effect feature vector generation module is used for constructing a similarity vector of each drug and a similarity vector of each side effect based on the drug similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space with the same dimension to obtain the feature vectors of the drugs and the side effects;

the drug-side effect pair interaction diagram construction module is used for constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects;

the medicine side effect occurrence frequency prediction model building module is used for building a network architecture of a medicine side effect occurrence frequency prediction model and carrying out network training by utilizing the training data set and the medicine and side effect data thereof to obtain a trained medicine side effect occurrence frequency prediction model;

and the prediction module is used for inputting the data of the new medicine to be predicted into the trained medicine side effect occurrence frequency prediction model according to the medicine data processed in the step 2-4 to obtain a new medicine side effect occurrence frequency prediction result.

9. A terminal, characterized by: the method comprises the following steps:

one or more processors;

a memory storing one or more computer programs;

the processor invokes the memory-stored computer program to implement:

the process steps of any one of claims 1 to 7.

10. A readable storage medium, characterized by: a computer program is stored, which is invoked by a processor to implement:

the process steps of any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of computer bioinformatics and machine learning, and particularly relates to a method, a system, a terminal and a readable storage medium for predicting the occurrence frequency of new drug side effects based on similarity.

Background

Estimation of the frequency of occurrence of side effects of a drug is critical in the risk-benefit assessment of the drug. Currently, the frequency of side effects of drugs is estimated by using the intervention group and the placebo group in randomized controlled trials. Although they are standard methods to eliminate the selection bias in clinical medicine, these tests are limited by sample size and time complexity. On the other hand, it is well known that side effects of many drugs are not observed in clinical trials, and they are recognized by people after they have entered the market. For this reason, the side effects of drugs remain a major cause of morbidity and mortality in healthcare, resulting in billions of dollars of losses each year. For example, an appetite suppressant named Fen-Phen exits the market after death of many patients taking the suppressant. Therefore, the method has important practical significance for analyzing and predicting the side effect of the medicine by using bioinformatics means.

In recent years, many computational models have been developed to predict drug side effects based on drug-related databases. However, most methods only discuss whether a drug has one or more side effects, and cannot determine the frequency of occurrence of the side effects. The frequency of side effects is a central issue for the risk and benefit assessment of drugs. Accurate estimation of the frequency of occurrence of drug side effects is not only critical for patient care in clinical practice, but also important for pharmaceutical companies because it reduces the risk of withdrawal of the drug from the market. Although two methods have been proposed to predict the frequency of occurrence of side effects of drugs, they both rely heavily on known associations or frequencies of side effects of drugs and do not allow prediction of the frequency of occurrence of side effects of new drugs. For example, based on known drug side effect frequencies, Galeano et al constructed a contiguous matrix of drug side effects and proposed a new matrix decomposition model to predict the frequency of potential drug side effects. This model achieves good prediction performance, but when a given sample is a new drug without side effect information, methods that rely on the frequency of known drug side effects will not be able to predict their potential side effects. Furthermore, Zhao et al developed a deep learning framework to predict the frequency of side effects of drugs by integrating chemical structure similarity, known drug side effect frequency scores, side effect semantic similarity, and pre-training word vector representations. The core of the model is to construct a drug side effect bipartite graph and learn the feature representation of the nodes in the graph from the direct neighbors of the nodes based on an attention mechanism. However, drugs that do not belong to the training data set do not have neighbor nodes in the constructed heteromorphic graph, and therefore the model also cannot predict the frequency of occurrence of side effects of new drugs.

Therefore, it is very important to provide a method capable of predicting the frequency of side effects of a new drug.

Disclosure of Invention

The invention provides a method, a system, a terminal and a readable storage medium for predicting the side effect occurrence frequency of a new medicine based on similarity according to various medicines and side effect similarity information, aiming at the technical problem that a method capable of predicting the side effect occurrence frequency of the new medicine is lacked in the prior art. The method makes full use of rich information in similarity to form a drug similarity vector, a side effect similarity vector and a drug-side effect pair interaction diagram, so that a model for predicting the occurrence frequency of drug side effects is constructed by network training, and a biological experiment researcher can be helped to further accurately discover the side effects of the new drug and determine the occurrence frequency of the side effects.

In one aspect, the invention relates to a method for predicting the occurrence frequency of side effects of a new drug based on similarity, which comprises the following steps:

step 1: constructing a training data set based on known drug-side effect frequency information;

step 2: acquiring drug similarity information and side effect similarity information;

and step 3: constructing a similarity vector of each medicament and a similarity vector of each side effect based on the medicament similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space of the same dimension to obtain feature vectors of the medicaments and the side effects;

and 4, step 4: constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects;

The method for predicting the occurrence frequency of the side effects of the new medicine realizes the prediction of the side effects of the new medicine.

Optionally, if the known drug-side effect frequency information in step 1 includes data encoded according to whether the frequency scores of the drug and the side effects are known, the drug side effect occurrence frequency prediction model in step 5 is used for predicting whether the new drug and the side effects have an association relationship, and the training process of the drug side effect occurrence frequency prediction model is as follows:

step 5.1: building a network architecture of a medicine side effect occurrence frequency prediction model based on a deep convolutional neural network and a multilayer perceptron;

for example, in some implementations, the decision threshold is set to 0.5, and if the drug does not have a corresponding side effect, the frequency of occurrence between the output drug and the side effect is scored as 0.

Step 5.2: and taking the drug characteristic vector, the side effect characteristic vector and the drug-side effect pair interaction diagram as network input, and taking the known drug-side effect frequency information in the training data set as a label to perform network training.

Alternatively, if the known drug-side effect frequency information in step 1 includes data encoded with a frequency score value of the drug and the side effect, the drug side effect occurrence frequency prediction model in step 5 is used to predict frequency data of the new drug and the side effect based on the encoding rule; the training process of the drug side effect occurrence frequency prediction model is as follows:

step 5-1: building a network architecture of a drug side effect occurrence frequency prediction model;

the method comprises the steps of utilizing a deep convolutional neural network to perform feature extraction on a drug-side effect interaction diagram to obtain drug-side effect interaction embedded data, and utilizing a multilayer perceptron to perform feature extraction on a drug feature vector and a side effect feature vector respectively to obtain drug embedded data and side effect embedded data; splicing the drug-side effect interaction embedded data, the drug embedded data and the side effect embedded data, and inputting the spliced vectors into a multi-layer perceptron to obtain frequency data of a drug-side effect association pair based on a coding rule;

Alternatively, if the known drug-side effect frequency information in step 1 includes the adjacency matrix DMA encoded with whether the frequency scores of the drug and the side effect are known or not and the adjacency matrix DMF encoded with the frequency score values of the drug and the side effect; the drug side effect occurrence frequency prediction model in the step 5 is used for predicting whether the new drug has a correlation with the side effect, and further predicting the frequency data of the new drug and the side effect based on the coding rule aiming at the new drug and the side effect with the correlation; the training process of the drug side effect occurrence frequency prediction model is as follows:

s5-1: building a network architecture of a medicine side effect occurrence frequency prediction model based on a deep convolutional neural network and a multilayer perceptron;

Optionally, the formula of the feature vector of the drug and the side effect in step 3 is as follows:

Optionally, the process for constructing the drug-side effect pair interaction map in step 4 is as follows:

in the formula (I), the compound is shown in the specification,is a medicine d_iClass k drug similarity and side effects of (a)_jDrug-side effect pair interaction plot between class i side effect similarities of (i),denotes the drug d_iThe feature vector of the kth class of similarity,indicates side effects s_jThe feature vector of class i similarity;

Optionally, the drug similarity information is represented by a drug similarity matrix SMD_Similarity,SMD_Experimental,SMD_Database,SMD_{Text_mining},SMD_{Combined_score},SMD_Structure,SMD_Target,SMD_WordWherein the side effect similarity information is a side effect similarity matrix SME_Semantic,SME_WordAnd (4) forming.

In a second aspect, the present invention provides a system based on the method for predicting the occurrence frequency of side effects of a new drug, comprising:

a training data set construction module for constructing a training data set based on known drug-side effect frequency information;

the medicine similarity information acquisition module is used for acquiring medicine similarity information;

the side effect similarity information acquisition module is used for acquiring side effect similarity information;

In a third aspect, the present invention provides a terminal, comprising:

one or more processors;

a memory storing one or more computer programs;

the processor invokes the memory-stored computer program to implement:

a step of a method for predicting the occurrence frequency of side effects of a new drug based on similarity.

In a fourth aspect, the present invention provides a readable storage medium storing a computer program for invocation by a processor to implement:

a step of a method for predicting the occurrence frequency of side effects of a new drug based on similarity.

Advantageous effects

1. The invention provides a method for predicting the occurrence frequency of side effects of a new medicine based on similarity information of medicines and side effects, and the method makes full use of rich information in the similarity to form a medicine similarity vector, a side effect similarity vector and a medicine-side effect pair interaction diagram, so that a medicine side effect occurrence frequency prediction model is constructed by network training. The method creatively realizes the prediction of the drug side effect frequency of the new drug, and can help biological experiment researchers to further accurately discover the side effect of the new drug and determine the occurrence frequency of the side effect.

2. In a further preferred scheme of the invention, various types of similarity information are selected, so that the reliability of the prediction result is improved.

3. In a further preferred scheme of the invention, the convolutional neural network and the multilayer perceptron are used for feature extraction and prediction, and particularly, the prediction of whether the new drug has an association relation with the side effect and the frequency prediction of the new drug and the side effect are further realized.

Drawings

Fig. 1 is a schematic flow chart of a method for predicting the occurrence frequency of a new drug side effect based on similarity, which is provided in embodiment 1 of the present invention.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims. The present invention will be further described with reference to the following examples.

Example 1:

the embodiment provides a method for predicting the occurrence frequency of side effects of a new drug based on similarity, which comprises the following steps:

s1: a training data set is constructed based on known drug-side effect frequency information.

The data set in this embodiment is composed of a contiguous matrix DMA encoded with whether the frequency score of the drug and side effect is known or not and a contiguous matrix DMF encoded with the frequency score value of the drug and side effect. Wherein the dimensionalities of the adjacency matrixes DMA and DMF are both N_d*N_s；N_dFor training setNumber of kinds of Chinese herbs, N_sThe number of categories of side effects in the training set.

For example, a data set of known drug-side effect frequency information is collected based on the SIDER v4.0 database; wherein the dimension of the constructed adjacency matrixes DMA and DMF is 757 x 994; 757 the type and quantity of the medicine in the data set, 994 the type and quantity of the side effect in the data set; the value of each element in the relational adjacency matrix DMA is coded according to whether the frequency scores of a corresponding drug and a side effect are known or not, if the frequency scores of a certain drug and a certain side effect exist in the data set, the corresponding position in the DMA is 1, and if not, the corresponding position is 0; the value of each element in DMF is encoded by a specific frequency score for a drug and a side effect, which is divided into five levels based on the studies by Galeano et al: very rarely (frequency 1), rarely (frequency 2), infrequently (frequency 3), frequently (frequency 4) and very frequently (frequency 5), if there is a frequency score for a drug and a side effect in the data set, the corresponding location in the DMA is the frequency score, otherwise it is 0. In other possible embodiments, the encoding rule may be adjusted according to application requirements, which is only an example, it should be understood that the sample is set based on which type of encoding rule, and the corresponding prediction result also corresponds to the type of encoding rule, as above, if the sample is classified and scored based on the above scoring rule, the obtained prediction result is: the score of the frequency of side effects of the new drug is associated with the five levels. If the sample directly utilizes the occurrence frequency value, the predicted result is the occurrence frequency of the side effect of the new medicine.

S2: drug similarity information and side effect similarity information are obtained.

In this embodiment, 8 drug similarity matrices of different similarity types and 2 side effect similarity matrices of different similarity types are constructed for the drugs and side effects in the training set.

The drug similarity matrix is as follows:

SMD_Similarity,SMD_Experimental,SMD_Database,SMD_{Text_mining},SMD_{Combined_score},SMD_Structure,SMD_Target,SMD_Word。

the side effect similarity matrix is as follows:

obtaining a matrix SME based on semantic descriptors of side effects_SemanticAnd a matrix SME obtained based on the word vector information of the side effect_Word。

Wherein, the drug similarity matrix SMD_Similarity、SMD_Experimental、SMD_Database,SMD_{Text_mining}And SMD_{Combined_score}Directly constructing according to the known association information between the medicines in the STITCH database; similarity scores of the type named 'Similarity', 'Experimental', 'Database', 'Text Mining' and 'Combined Score' 5 were collected in the STITCH Database. SMD (surface mounted device)_StructureFor matrices derived based on structural information, SMD_TargetMatrices, SMD, derived for drug-based target information_WordA matrix derived for the drug-based word vector information.

Drug similarity matrix SMD_StructureIs constructed according to the SMILES sequence information of the medicine; the SMILES sequence of the drug was collected from the Pubchem database; inputting the collected SMILES sequence into a Python toolkit Rdkit, and assigning a 2048-dimensional molecular character description fingerprint vector to each medicine; based on molecular descriptor fingerprint vector, drug similarity matrix SMD_StructureCan be constructed in the following way:

in the formula, SMD_Structure(i, j) is drug d_iAnd a drug d_jCorresponding to the drug similarity matrix SMD_StructureValue of (FV)_iAnd FV_jDenotes the drug d_iAnd a drug d_jThe molecular characters obtained based on the SMILES sequence information describe the fingerprint vector.

For SMD_TargetIn this example by DrugbaThe nk database collects known protein target information for drugs, each of which can be represented by an 847-dimensional target feature vector. Each dimension of the vector represents a protein whose value is set to 1 if the protein is drug-targeted and 0 otherwise. And calculating the target similarity between the two drugs according to the cosine similarity coefficient:

in the formula, SMD_Target(i, j) is drug d_iAnd d_jCorresponding to the matrix SMD_TargetValue of, TVⁱAnd TV^jRespectively represent any two drugs d_iAnd d_jThe target feature vector of (a) is,andrespectively representing the kth bit of the target feature vector, and finally constructing a drug Similarity Matrix (SMD) by calculating target similarity values among all drugs_Target。

Drug similarity matrix SMD_WordConstructed according to SMILES sequence information as follows:

firstly, collecting SMILES sequence information of a medicine from a Pubchem database, and then inputting the sequence information into a pre-trained Mol2vec model to obtain a medicine word coding vector; calculating the similarity between word coding vectors between any two medicaments by adopting a cosine angle similarity method as a word coding similarity value between the medicament pairs; finally, a medicine similarity matrix SMD is constructed by calculating target similarity values among all medicines_Word。

Side effect similarity matrix SME_SemanticBased on semantic description information of all side effects, one side effect corresponds to one or more semantic descriptors, so that each side effect is represented by the semantic descriptors related to the side effect and the semantic descriptor of the side effectA DAG was constructed as follows:

in this example, based on the ADRecS database, semantic descriptors of side effects are collected; then, each side effect in the data set constructs a corresponding Directed Acyclic Graph (DAG) according to the semantic descriptors, wherein nodes in the graph represent the semantic descriptors of the side effect, and directed edges in the graph represent the relationship between the semantic descriptors, such as a certain side effect s_iIs shown asWhereinIndicates side effects s_iThe descriptor of (a) and its set of ancestor nodes,representing the set of edges in the diagram that connect the descriptors. From the directed acyclic graph DAG for each side effect, the contribution value of each node in the graph to the side effect can be calculated:

in the formula, theta is a semantic contribution attenuation factor and represents the probability of the side effect s along with the node t_iIncrease of the distance between the descriptors of (1), which has an adverse effect on the side effects of s_iThe contribution of the semantics of the descriptor is reduced. t is t^*Representing descendant nodes of node t in the DAG. Then, the semantic similarity matrix SME of the side effect is calculated by the following formula_Semantic：

Side effect similarity matrix SME_WordConstructed according to the name of the side effect, as follows:

firstly, collecting the name of the side effect from a database, and then inputting the name into a pre-trained word vector Glove model to obtain the word code of the side effectVector quantity; calculating the similarity between word coding vectors between any two side effects by using a cosine angle similarity method as a word coding similarity value between side effect pairs and constructing a side effect word similarity matrix SMD_Word。

In this example, the dimensions of the similarity matrices for 8 different types of drugs are all 757 × 757, and the dimensions of the similarity matrices for 4 different types of side effects are all 994 × 994.

It should be understood that, in other possible embodiments, the above-mentioned combination of partial similarity matrices may be selected, or other similarity matrices may be added on the basis of the above-mentioned similarity matrices, which is not specifically limited by the present invention.

S3: and constructing a similarity vector of each medicament and a similarity vector of each side effect based on the medicament similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space of the same dimension to obtain the feature vectors of the medicaments and the side effects.

In this example, collecting similarity vectors of various types for drugs and side effects specifically means: defining a set of drug similarity matricesAnd side effect similarity matrix setRespectively according toAndthe characteristics of the drug and side effects were collected.

Wherein the drug similarity matrix set is:

defining a set of side effect similarity matrices:

to be provided withFor example, the k-th similarity matrix in (1), drug d_iA similarity vector may be collected:

wherein the content of the first and second substances,is the drug d in the kth similarity matrix_iThe corresponding element value.

For theMiddle and first similarity matrix, side effects s_jA similarity vector may be collected:

wherein the content of the first and second substances,is the side effect s in the k-th similarity matrix_jThe corresponding element value.

To collections according to the above methodAndcan collect multiple types of similarity by traversing each elementA sexual vector. For each drug, similarity of each drug can be obtained to obtain a similarity vector; for each side effect, its similarity for each type of side effect can result in a similarity vector.

Performing feature mapping on all the similar vectors means that a plurality of similar vectors of the medicine and the side effect are projected into a feature mapping space with the same dimension to obtain initial feature vectors of the medicine and the side effect respectively; with a drug d_iAnd side effects s_jFor example, for the kth drug similarity vectorAnd the l side effect similarity vector, d_iAnd s_jThe feature vectors of (a) are respectively as follows:

wherein, P_kAnd Q_lIs a linear transfer matrix, P_kHas the dimension of r 757, Q_lDimension r 994; where r is the dimension of the similarity vector after linear transfer matrix transformation. In this example, r was obtained by ten-fold cross validation test and was set to 32.

The invention performs feature mapping on all the similarity vectors, and can obtain a plurality of drug and side effect feature vectors with the same dimension.

S4: and (4) performing outer product operation on the drug feature vectors and the side effect feature vectors generated by the different types of similarity in the step S3 in sequence to obtain a plurality of drug-side effect pair interaction graphs.

In this example, the outer product operation is specifically as follows:

in the formulaIs a matrix, also known as a drug-side effect pair interaction diagram, with dimensions r. The outer product operation is carried out on the characteristic vector of each medicine and the characteristic vector of the side effect in turn. For example, a drug-side effect pair of this example contains 16 cross-plots, which are constructed based on the similarity of class 8 drugs and the similarity of class 2 side effects.

S5: and building a network architecture of a medicine side effect occurrence frequency prediction model, and performing network training by using the training data set and the medicine and side effect data thereof to obtain the trained medicine side effect occurrence frequency prediction model.

In the embodiment, a deep convolutional neural network is used for carrying out feature extraction on a drug-side effect interaction diagram to obtain drug-side effect interaction embedded data, and a multilayer perceptron is used for respectively carrying out feature extraction on a drug feature vector and a side effect feature vector to obtain drug embedded data and side effect embedded data; splicing the drug-side effect interaction embedded data, the drug embedded data and the side effect embedded data, and inputting the data to a multilayer perceptron to obtain a prediction score of a drug-side effect association pair;

when the prediction score of the drug-side effect association pair is smaller than a preset judgment threshold (value 0.5), the occurrence frequency between the output drug and the side effect is 0 according to the fact that the drug does not have the corresponding side effect; and when the prediction score of the drug-side effect association pair is greater than or equal to a preset judgment threshold (value is 0.5), inputting the spliced backward quantity into a new multilayer perceptron to obtain the frequency data of the drug-side effect association pair based on the coding rule according to the fact that the drug has the corresponding side effect.

In this embodiment, the deep convolutional neural network is composed of 6 hidden layers, the number of channels of each hidden layer is 32, the step length is 2, and the convolution kernel is 2 × 2; the multiple multilayer perceptrons consist of 3 hidden layers, the dimensionality of each layer is r, and the value of r is determined by performing a 10-fold cross validation experiment on a training set; the judgment threshold was 0.5. In other possible embodiments, the setting of the network parameters may be optimized or adjusted.

In the training process, model parameters are initialized, then data in a training library, a drug characteristic vector, a side effect characteristic vector and a drug-side effect pair interaction diagram are input according to the content, a mean square loss function is used as a loss function of the whole model to perform backward propagation layer by layer, and parameters in the model are updated by Adam iteration.

Mean square Loss function Loss for prediction of drug-side effect pair associations₁The expression of (a) is as follows:

in the formula, M₁And M₂Respectively represent the number of positive and negative samples in the training set,andtrue associative labels and predictive labels, respectively, representing training data. Loss function Loss for prediction of frequency scores between drug-side effect pairs₂The definition is as follows:

in the formula (I), the compound is shown in the specification,andthe true correlation label and the prediction label represent the positive sample in the training data, respectively. Jointly training two Loss functions, the Loss function Loss_totalThe following steps are changed:

in the formula, M₃And M₄The number of parameters in the model is represented, and μ is a hyper-parameter that determines the regularization influence degree, and is set to 0.0005.

In other possible embodiments, other algorithms may be selected for parameter optimization in the model training process.

S6: and (4) acquiring the medicine similarity information of the new medicine aiming at the new medicine to be predicted, processing the medicine data according to the modes of the step (3) and the step (4), and inputting the medicine data into the trained medicine side effect occurrence frequency prediction model to obtain a new medicine side effect occurrence frequency prediction result.

For example, in the present embodiment, a new drug with unknown frequency information is predicted, and the prediction results are sorted from large to small to generate a potential frequency list of drug-side effect pairs.

In this embodiment, the model for predicting the occurrence frequency of the side effect of the drug is used to predict whether the new drug has an association relationship with the side effect, and further predict the frequency data of the new drug and the side effect based on the coding rule for the new drug and the side effect for which the association relationship is predicted, such as frequency scores in this embodiment: very rare (frequency 1), rare (frequency 2), infrequent (frequency 3), frequent (frequency 4) and very frequent (frequency 5). Therefore, in the present embodiment, the model for predicting the occurrence frequency of the side effect of the drug predicts whether the association exists first, and further predicts the frequency score of the drug and the side effect having the association.

In other possible embodiments, according to application requirements, the drug side effect occurrence frequency prediction model may be set to predict whether a new drug is associated with a side effect, or according to application requirements, the drug side effect occurrence frequency prediction model may be set to predict frequency data of a new drug and a side effect based on coding rules. The data, i.e. the labels, of the corresponding training set are adjusted accordingly.

Example 2:

the present embodiment provides a system based on the method for predicting the occurrence frequency of side effects of a new drug, which includes: the system comprises a training data set construction module, a medicine similarity information acquisition module, a side effect similarity information acquisition module, a medicine and side effect feature vector generation module, a medicine-side effect pair interaction diagram construction module, a medicine side effect occurrence frequency prediction model construction module and a prediction module.

Wherein the training data set construction module is used for constructing a training data set based on the known medicine-side effect frequency information; the medicine similarity information acquisition module is used for acquiring medicine similarity information; the side effect similarity information acquisition module is used for acquiring side effect similarity information; the drug and side effect feature vector generation module is used for constructing a similarity vector of each drug and a similarity vector of each side effect based on the drug similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space with the same dimension to obtain the feature vectors of the drugs and the side effects; the drug-side effect pair interaction diagram construction module is used for constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects; the drug side effect occurrence frequency prediction model construction module is used for building a network architecture of a drug side effect occurrence frequency prediction model, and performing network training by using the training data set and the drug and side effect data thereof to obtain a trained drug side effect occurrence frequency prediction model; and the prediction module is used for inputting the data of the new medicine to be predicted into the trained medicine side effect occurrence frequency prediction model according to the medicine data processed in the step 2-4 to obtain a new medicine side effect occurrence frequency prediction result.

The present invention is not described herein in detail, and the division of the functional module unit is only a division of a logic function, and there may be another division manner in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. Meanwhile, the integrated unit can be realized in a hardware form, and can also be realized in a software functional unit form.

Example 3:

the present embodiment provides a terminal, which includes: one or more processors, and memory storing one or more computer programs. Wherein the processor invokes the memory-stored computer program to implement: a step of a method for predicting the occurrence frequency of side effects of a new drug based on similarity. Specifically, the method comprises the following steps:

step 1: constructing a training data set based on known drug-side effect frequency information;

step 2: acquiring drug similarity information and side effect similarity information;

and step 3: constructing a similarity vector of each medicament and a similarity vector of each side effect based on the medicament similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space of the same dimension to obtain feature vectors of the medicaments and the side effects;

and 4, step 4: constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects;

The implementation process can refer to embodiment 1 and its extended description.

The terminal further includes: and the communication interface is used for communicating with external equipment and carrying out data interactive transmission.

The memory may include high speed RAM memory, and may also include a non-volatile defibrillator, such as at least one disk memory.

If the memory, the processor and the communication interface are implemented independently, the memory, the processor and the communication interface may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture bus, a peripheral device interconnect bus, an extended industry standard architecture bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc.

Optionally, in a specific implementation, if the memory, the processor, and the communication interface are integrated on a chip, the memory, the processor, that is, the communication interface may complete communication with each other through the internal interface.

The specific implementation process of each step refers to the explanation of the foregoing method.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

Example 4:

the present embodiments provide a readable storage medium storing a computer program for invocation by a processor to implement: a step of a method for predicting the occurrence frequency of side effects of a new drug based on similarity. Specifically, the method comprises the following steps:

step 1: constructing a training data set based on known drug-side effect frequency information;

step 2: acquiring drug similarity information and side effect similarity information;

and step 3: constructing a similarity vector of each medicament and a similarity vector of each side effect based on the medicament similarity information and the side effect similarity information, and mapping the similarity vectors to a feature mapping space of the same dimension to obtain feature vectors of the medicaments and the side effects;

and 4, step 4: constructing a drug-side effect pair interaction diagram based on the feature vectors of the drugs and the feature vectors of the side effects;

The implementation process can refer to embodiment 1 and its extended description.

The specific implementation process of each step refers to the explanation of the foregoing method.

The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the controller. Further, the readable storage medium may also include both an internal storage unit of the controller and an external storage device. The readable storage medium is used for storing the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.

Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Simulation and verification:

in order to verify the effectiveness of the invention, two verification modes are adopted by referring to the verification standards of calculation models in other related fields: (1) ten times of cross validation; (2) local 5-fold cross validation and four evaluation indexes: AUC (the area summary ROC curves), AUPR (the area under the precision-correct curve), RMSE (root mean squared error) and MAE (mean absolute error) were used to evaluate the model. In ten-fold cross validation, the data set was randomly divided into 10, 1 in turn was selected as the test set, the remaining 9 as the training set, and repeated 10 times. In the local 5-fold cross validation test, the drugs in the data set are randomly divided into 5 parts, 1 part is selected in turn, all the known drug-side effect frequency pairs related to the drug are collected as a test set, the remaining known drug-side effect frequency pairs in the data set are collected as a training set, and the test is repeated for 5 times.

To verify the effectiveness of the proposed method (SDPred) in terms of the frequency of drug side effects, SDPred was compared with the two currently only methods Galeano's method and MGPred for the problem of predicting the frequency of drug side effects. Table 1 shows the results of comparing the Galeano et al method, MGPred and SDPred. Our method showed 1.4%, 2.0%, 7.9% and 9.4% improvement over the second best MGPred in AUROC, AUPRC, RMSE and MAE, respectively. The results show that even though MGPred utilizes a neural network, it collects information from three angles only, and does not extract information on the interaction between the drug and the side effects reasonably, and thus its performance is inferior to our method.

TABLE 1 Algorithm Performance index in Ten-fold Cross validation

SDPred is the first method capable of predicting the side effect frequency of a new drug at present, and in order to verify the reliability of the prediction of the side effect of the new drug by SDPred, the SDPred is compared with CMF, CRMF, NRLMF and TMF methods aiming at the problem of prediction of the correlation between the new drug and the side effect based on a data set constructed by Guo et al. Table 2 shows the results of comparison of 5 models, and it can be found that the SDPred of the present invention is superior to other methods. On the other hand, it can be seen in table 2 that the AUC values and the AUPR are lower than the ten-fold cross validation in table 1 due to the lack of known drug-side effect frequency relationship predicted for the new compound, which indirectly illustrates the importance of the known drug-side effect frequency relationship.

TABLE 2 Algorithm performance index in local quintupling cross validation

To verify that SDPred can indeed predict the potential side effects of a drug and the frequency of occurrence of side effects, a case analysis was performed on one of the drugs, the analysis results are shown in table 3. Case analysis results show that 8 of the first 10 unknown side effects of the drug escitalopram predicted by SDPred were found to exist in the database of sid and PubMed, and the side effects of the drug were mentioned in the presence of numerous references, indicating that the side effects have a relatively high frequency of occurrence. This further demonstrates that SDPred of the invention can help biological laboratory researchers to further discover the exact drug side effects and the frequency of side effects.

TABLE 3 SDPred results of case analysis of the drug escitalopram

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种实现医患远程交流的网络部署方法

Similarity-based method, system, terminal and readable storage medium for predicting occurrence frequency of side effects of new drug

相关技术

网友询问留言