Pharmaceutical protein binding rate prediction method and system based on structure and grade classification

文档序号:1629539 发布日期:2020-01-14 浏览:21次 中文

阅读说明:本技术 一种基于结构和等级分类的药物蛋白结合率预测方法及系统 (Pharmaceutical protein binding rate prediction method and system based on structure and grade classification ) 是由 相小强 袁雅文 张政 常硕 张彦春 李治纲 蔡卫民 田凌浩 于 2019-08-13 设计创作,主要内容包括:本申请涉及一种基于结构和等级分类的药物蛋白结合率预测方法及系统,包括:(1)数据收集,并对收集到PPB数据值进行处理,并去除重复的药物分子;(2)将药物分子的PPB值分为高结合药物、中结合药物、低结合药物三个等级的数据集;(3)计算分子描述符的数据值,并进行相关性筛选,选出与药物蛋白结合率最相关的一组分子描述符;(4)采用机器学习算法,分别建立三个等级的定量预测模型;(5)将药物分子的分子描述符代入对应等级的定量预测模型,对药物分子的蛋白结合率进行预测。本申请可提高高结合药物PPB预测的准确度,解决现有技术中高结合药物PPB预测准确度低的问题。(The application relates to a method and a system for predicting drug protein binding rate based on structure and grade classification, comprising the following steps: (1) collecting data, processing the collected PPB data value, and removing repeated drug molecules; (2) dividing the PPB value of a drug molecule into three grades of data sets of high-binding drug, medium-binding drug and low-binding drug; (3) calculating the data value of the molecular descriptors, and performing correlation screening to select a group of molecular descriptors most related to the binding rate of the drug protein; (4) respectively establishing quantitative prediction models of three levels by adopting a machine learning algorithm; (5) and substituting the molecular descriptors of the drug molecules into the quantitative prediction models of the corresponding grades to predict the protein binding rate of the drug molecules. The method and the device can improve the accuracy of the prediction of the high binding drug PPB and solve the problem of low accuracy of the prediction of the high binding drug PPB in the prior art.)

1. A prediction method of drug protein binding rate based on structure and grade classification is characterized in that the prediction method comprises the following steps:

(1) collecting protein binding rate data values of different drug molecules and corresponding structure codes, and processing the collected protein binding rate data values of the drug molecules to remove repeated drug molecules;

(2) the data values of the protein binding rate of the drug molecules obtained according to step (1) are divided into three levels of data sets, namely: a high-binding drug data set, a medium-binding drug data set and a low-binding drug data set, and dividing the data sets of the three grades into a training set and a testing set respectively;

(3) calculating the data value of the molecular descriptor of the drug molecule, encoding the molecular structure by using the molecular descriptor, and simultaneously performing correlation screening on the molecular descriptor to screen out a group of molecular descriptors most related to the binding rate of the drug protein;

(4) respectively establishing quantitative prediction models of three levels by adopting a machine learning algorithm according to the molecular descriptors obtained in the step (3);

(5) when the drug protein binding rate of a certain drug is predicted, the level of the drug protein binding rate is judged according to the molecular descriptor parameters, and the molecular descriptor parameters are substituted into a quantitative prediction model of the corresponding level to predict the drug protein binding rate.

2. The method for predicting the binding rate of a drug protein based on the structure and grade classification as claimed in claim 1, wherein in the step (2), when PPB is more than or equal to 0.8, the high binding drug data set is divided; when PPB is more than or equal to 0.4 and less than or equal to 0.8, dividing into medium-binding drug data sets; when the PPB <0.4, a low binding drug data set was assigned.

3. The method for predicting the binding rate of pharmaceutical proteins based on structure and grade classification according to claim 1, wherein in the step (3), the molecular descriptors are calculated by using PadEL-Descriptor software.

4. The method for predicting the binding rate of pharmaceutical proteins based on structure and grade classification as claimed in claim 1, wherein in the step (4), a plurality of machine learning algorithms are adopted to establish quantitative prediction models, and meanwhile, the prediction results of the quantitative prediction models are averaged to obtain an average consensus model.

5. The method as claimed in claim 4, wherein the machine learning algorithm includes random forest, lifting tree, k-nearest neighbors, support vector regression and gradient lifting regression.

6. The method for predicting drug protein binding rate according to claim 1, wherein in the step (2), the data sets of three grades are divided into training set and testing set according to 8:2 ratio.

7. The method for predicting the protein binding rate of a drug based on the structure and grade classification of claim 1, wherein the step (1) comprises the steps of:

(a) processing the collected protein binding rate data value of the drug molecules, and determining the protein binding rate of a fixed value for the drug molecules with the protein binding rate data value belonging to a numerical range;

(b) according to the naming, structure coding and properties of the drug molecules, the repeated drug molecules are checked;

(c) the molecular structure of the drug is simply processed.

8. The method according to claim 7, wherein in (a), if the collected data values of protein binding rates are within a range of values, the mean value of the range of values is taken as the data value of protein binding rate of the drug molecule;

if the collected protein binding rate is greater than or less than a fixed value, if there is a more reliable data source, the data value from the more reliable source is selected as the data value for the protein binding rate of the drug molecule, and if not, the fixed value is taken.

9. The method for predicting the binding rate of a drug protein according to the structure and grade classification of claim 7, wherein the method for examining the repeated drug molecules in (b) comprises: in repeated drug molecules, PPB values are the same, and the repetition is removed; PPB values are different, and the source is more reliable after comparison.

10. A system for predicting a drug protein binding rate, comprising:

the data processing module is used for processing the collected protein binding rate data value of the drug molecules and removing repeated drug molecules;

a ranking module for dividing the protein binding rate data values of the drug molecules into three ranked data sets, namely: a high binding drug dataset, a medium binding drug dataset, and a low binding drug dataset;

the molecular descriptor calculation module is used for calculating the data value of the molecular descriptor, performing correlation screening and selecting a group of molecular descriptors most relevant to the binding rate of the drug protein;

the modeling module is used for respectively establishing quantitative prediction models of three levels by adopting a machine learning algorithm;

and the prediction module is used for substituting the molecular descriptors of the drug molecules into the quantitative prediction models of the corresponding grades to predict the protein binding rate of the drug molecules.

Technical Field

The application belongs to the technical field of drug design, particularly relates to prediction of drug protein binding rate, and particularly relates to a method and a system for predicting drug protein binding rate based on structure and grade classification.

Background

After the drug is absorbed from the administration site into the blood, a part of the drug is bound to plasma protein to form a bound drug, and a part of the drug is in a free molecular state, and the drug can exert the drug effect only when the drug is in a free form

The combination of the drug and the plasma protein not only has influence on the absorption, distribution, metabolism and excretion process of the drug in vivo, but also is closely related to the pharmacological action strength of the drug. Therefore, the research on the plasma protein binding rate of the medicament is not only beneficial to knowing the design of the dosage scheme and evaluating the safety of the medicament, but also has important significance for the research on the pharmacy of the components of the mars.

The drug binds to plasma proteins to different extents in plasma, and the extent of binding can affect the in vivo process (ADME) of the drug, i.e. the process of handling the drug by the body, and thus the pharmacodynamic behavior of the drug. Therefore, drug protein binding rate (PPB) can be an important parameter for therapeutic drug monitoring and ADME assessment.

The free drug can penetrate cell membrane and combine with target spot, and the combination of drug and plasma protein is a reversible process and is in equilibrium state. High plasma protein binding may be associated with drug safety issues and some adverse effects, such as low clearance, low brain penetration, drug-drug interactions, loss of efficacy, while affecting the fate of enantiomers and diastereomers through in vivo stereoselective binding. The pharmacokinetic properties of the drug are secondary to toxicity, leading to failure of the candidate drug clinical trial. Drug design concepts based on drug similarity and based on properties have emerged in the end of the 90 s in an attempt to address pharmacokinetic challenges. Therefore, in the overall drug design approach, the pharmacokinetic properties are considered as important as the target affinity, and a great deal of research is focused on the PPB prediction.

With the development of information technology, many documents report methods for predicting the binding rate of plasma proteins, which mainly include ligand-based and structure-based prediction, and prediction by adopting a single machine learning algorithm, most methods have low accuracy in the high-binding drug part, and relatively low and medium binding methods are prone to interaction in vivo and have adverse reactions mainly concentrated in the high-binding drug.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the problem of low prediction accuracy of high-binding drugs is solved.

In order to solve the technical problem, the invention provides a method for predicting the binding rate of the drug protein based on the structure and grade classification, which can improve the prediction accuracy of the high-binding drug, reduce the risk of designing and researching new drugs and increase the applicability of the prediction method.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for predicting the binding rate of a drug protein based on structure and grade classification comprises the following steps:

(1) collecting protein binding rate data values of different drug molecules and corresponding structure codes, and processing the collected protein binding rate data values of the drug molecules to remove repeated drug molecules;

(2) the data values of the protein binding rate of the drug molecules obtained according to step (1) are divided into three levels of data sets, namely: a high-binding drug data set, a medium-binding drug data set and a low-binding drug data set, and dividing the data sets of the three grades into a training set and a testing set respectively;

(3) calculating the data value of the molecular descriptor of the drug molecule, encoding the molecular structure by using the molecular descriptor, and simultaneously performing correlation screening on the molecular descriptor to screen out a group of molecular descriptors most related to the binding rate of the drug protein;

(4) respectively establishing quantitative prediction models of three levels by adopting a machine learning algorithm according to the molecular descriptors obtained in the step (3);

(5) when the drug protein binding rate of a certain drug is predicted, the level of the drug protein binding rate is judged according to the molecular descriptor parameters, and the molecular descriptor parameters are substituted into a quantitative prediction model of the corresponding level to predict the drug protein binding rate.

Further optimally, according to the prediction method of the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (2), when PPB is more than or equal to 0.8, the drug protein is a high-binding drug; when PPB is more than or equal to 0.4 and less than or equal to 0.8, the drug is the combined drug; when PPB <0.4, low binding drug.

Further optimally, according to the prediction method of the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (3), the molecular descriptors are calculated by using the PadEL-Descriptor software.

Further optimally, according to the method for predicting the drug protein binding rate based on the structure and grade classification, provided by the invention, in the step (4), a plurality of machine learning algorithms are adopted to establish quantitative prediction models, and meanwhile, the prediction results of the quantitative prediction models are averaged to obtain an average consensus model.

Further optimally, according to the prediction method of the drug protein binding rate based on the structure and the grade classification, provided by the invention, the machine learning algorithm comprises random forests, lifting trees, k-nearest neighbors, support vector regression and gradient lifting regression.

Further optimally, according to the prediction method of the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (4), the data sets of the three grades are divided into the training set and the test set according to the ratio of 8:2 respectively.

Further optimally, according to the method for predicting the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (1), the method for processing the data value of the drug molecule protein binding rate comprises the following steps:

(a) processing the collected protein binding rate data value of the drug molecules, and determining the protein binding rate of a fixed value for the drug molecules with the protein binding rate data value belonging to a numerical range;

(b) according to the naming, structure coding and properties of the drug molecules, the repeated drug molecules are checked;

(c) the molecular structure of the drug is simply processed.

Preferably, according to the method for predicting the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (a), if the collected data value of the protein binding rate is within a numerical range, the average value of the numerical range is taken as the data value of the protein binding rate of the drug molecule;

if the collected protein binding rate is greater than or less than a fixed value, if there is a more reliable data source, the data value from the more reliable source is selected as the data value for the protein binding rate of the drug molecule, and if not, the fixed value is taken.

Further optimally, according to the prediction method of the drug protein binding rate based on the structure and grade classification provided by the invention, in the step (b), the method for checking repeated drug molecules is as follows: in repeated drug molecules, PPB values are the same, and the repetition is removed; PPB values are different, and the source is more reliable after comparison.

The present application also provides a system for predicting a drug protein binding rate, comprising:

the data processing module is used for processing the collected protein binding rate data value of the drug molecules and removing repeated drug molecules;

a ranking module for dividing the protein binding rate data values of the drug molecules into three ranked data sets, namely: a high binding drug dataset, a medium binding drug dataset, and a low binding drug dataset;

the molecular descriptor calculation module is used for calculating the data value of the molecular descriptor, performing correlation screening and selecting a group of molecular descriptors most relevant to the binding rate of the drug protein;

the modeling module is used for respectively establishing quantitative prediction models of three levels by adopting a machine learning algorithm;

and the prediction module is used for substituting the molecular descriptors of the drug molecules into the quantitative prediction models of the corresponding grades to predict the protein binding rate of the drug molecules.

The invention has the beneficial effects that: the method can improve the prediction accuracy of the high-binding drug, reduce the risk of designing and researching new drugs and increase the applicability of the prediction method.

Drawings

The technical solution of the present application is further explained below with reference to the drawings and the embodiments.

FIG. 1 is a flow chart of a prediction method according to an embodiment of the present application;

FIGS. 2a, 2b and 2c are standard error distribution plots for PPB prediction based on molecular descriptors calculated by ADMET Predictor software, PaDEL-Descriptor software and Dragon software, respectively.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The technical solutions of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于深度学习的蛋白质-配体结合位点预测算法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!