Method for predicting fish biological enrichment factor of organic compound by adopting multi-parameter linear free energy relation model

文档序号:1393481 发布日期:2020-02-28 浏览:14次 中文

阅读说明:本技术 采用多参数线性自由能关系模型预测有机化合物的鱼类生物富集因子的方法 (Method for predicting fish biological enrichment factor of organic compound by adopting multi-parameter linear free energy relation model ) 是由 陈景文 丁蕊 李雪花 于 2019-11-07 设计创作,主要内容包括:本发明属于化学品生态风险评价的高通量测试策略技术领域,公开了一种采用多参数线性自由能关系模型预测有机化合物的鱼类生物富集因子的方法。搜索所需Abraham溶质参数描述符,应用所构建的pp-LFERs模型,即能快速、高效地预测有机化合物的鱼类生物富集因子,该方法简单快捷、成本低,且能节省所需的人力、物力和财力。本发明关于定量构效关系模型的构建和使用导则进行建模,运用简单、透明的多元线性回归分析方法,易于理解和应用;具有明确的应用域、良好的拟合能力、稳健性和预测能力,能够有效地预测应用域内有机化合物的鱼类生物富集因子,为化合物的生态风险性评价和管理提供必要的基础数据,具有重要的意义。(The invention belongs to the technical field of high-throughput testing strategies for ecological risk evaluation of chemicals, and discloses a method for predicting fish biological enrichment factors of organic compounds by adopting a multi-parameter linear free energy relation model. The method is simple, fast and low in cost, and can save required manpower, material resources and financial resources. The invention relates to the construction of a quantitative structure-activity relationship model and modeling by using a guide rule, and a simple and transparent multivariate linear regression analysis method is applied, so that the model is easy to understand and apply; the method has a clear application domain, good fitting ability, robustness and prediction ability, can effectively predict the fish biological enrichment factor of the organic compound in the application domain, provides necessary basic data for ecological risk evaluation and management of the compound, and has important significance.)

1. A method for predicting fish biological enrichment factors of organic compounds by adopting a multi-parameter linear free energy relation model is characterized by comprising the following steps:

firstly, collecting logarithmic BCF values of fish biological enrichment factors of 510 organic compounds; dividing logBCF values of the 510 organic compounds into a training set and a validation set, wherein the training set comprises 408 organic compounds, and the validation set comprises 102 organic compounds; wherein, the organic compounds in the training set are used for constructing a model, and the organic compounds in the verification set are used for external verification after the model is constructed; carrying out internal verification on the constructed model by adopting a de-one method;

in the model, Abraham solute parameter descriptors E, S, A, B and V are used as independent variables, and logBCF values of organic compounds in a training set are used as prediction variables to perform multiple linear regression analysis, so that the linear relation of the model is as follows:

logBCF=0.954×E–0.463×S–0.739×A–2.612×B+1.955×V+2.195

ntra=408,R2 adj.tra=0.839,RMSEtra=0.619,Q2 LOO=0.835,next=102,R2 adj.ext=0.840,RMSEext=0.680,Q2 ext=0.848;

wherein E is the molar excess refractive index; s is the molecular polarity/dipole moment parameter; a and B respectively represent the proton donor capacity and the proton acceptor capacity of a molecular hydrogen bond, and are also called hydrogen bond acidity and hydrogen bond alkalinity; v is the molecular volume of McGowan; e, S, A, B and V are also called Abraham parameters; n istraAnd nextTraining set and validation set compound numbers, respectively; r2 adjIs the decision coefficient of the correction; RMSE is the root mean square error; q2 LOOIs a one-out cross validation coefficient; q2 extIs the external verification coefficient.

2. The method of claim 1, wherein the organic compound comprises polycyclic aromatic hydrocarbons and their substitutes, heterocyclic compounds and their derivatives, halogenated alkanes, halogenated alkenes, esters, ethers, ketones, alcohols, phenols, anilines, and nitro compounds.

Technical Field

The invention belongs to the technical field of high-throughput testing strategies for ecological risk evaluation of chemicals, and discloses a method for predicting fish biological enrichment factors of organic compounds by adopting a multi-parameter linear free energy relation model.

Background

The bioaccumulation is a phenomenon in which an organism accumulates a certain element or a poorly degradable substance from the surrounding environment and the food chain so that the concentration thereof in the organism exceeds the concentration in the surrounding environment. The evaluation of bioaccumulation of organic chemicals (B) is one of the core links in the management of chemical risk. Currently, the most common evaluation index of bioaccumulation is bio-enrichment Factor (BCF), which is defined as: when the equilibrium state is reached, the ratio of the concentration of the pollutants in the organism to the concentration of the pollutants in the water body is obtained.

Experimental determination is a way to obtain BCF data of fish of compounds at present, and the economic cooperation and development Organization (OECD) issued the "water-flowing fish bio-enrichment test guideline (OECD guideline 305)" in 1996. However, the experimental method is long in period (usually 28-60 days), high in cost (the basic detection cost of chemicals set by the european union REACH regulation is about 8.5 ten thousand euros, wherein the bioaccumulation belongs to one of the very important detection indexes in the basic detection), and violates the animal protection principle (the experimental fish 100 tails are required in one experiment), so that the requirement for risk evaluation and data management of a large amount of existing commercial chemicals cannot be met only by acquiring data by an experimental determination method. Therefore, it is necessary to develop a reliable prediction method to acquire BCF data.

As the QSAR technology is helpful for realizing the pre-prevention principle of toxic and harmful chemical pollution management, related experiments can be reduced or replaced, the loss of experimental data is made up, the experimental cost is reduced, and the ecological risk evaluation and management of toxic and harmful chemicals in various countries in the world are widely developed. The economic cooperation and development Organization (OECD) in 2004 formally determined the guideline for the development and use of QSAR models, as follows: (1) has well-defined environmental indicators; (2) have a well-defined algorithm; (3) defining an application domain of the model; (4) the method has proper fitting degree, stability and prediction capability; (5) it is preferable to be able to perform a mechanism explanation.

Until now, many researchers have successfully established organic matters by applying the techniquesPredictive model of compound BCF values. Such as EPI Suite of the U.S. environmental protection agencyTMMolecular fragmentation was used in the software to predict compound logBCF. The model collects the logBCF values of 685 compounds, divides the logBCF values into four classes, namely non-ionized compounds, tin and mercury organic compounds and nitrogen-containing aromatic compounds, then establishes a linear model based on Kow for the four subdata sets respectively, and introduces a set of correction parameters related to molecular structure fragments in order to improve the degree of fitting of the model. Regression correlation coefficient R of the model20.833, but this method is relatively cumbersome to use and is not practical. The Kamlet-Taft solvatochromic parameter was used in the document "J.Chemosphere, 1993,26: 1905-. Although the correlation coefficient of the model is as high as 0.947, the data set for constructing the model is small (n is 51), the application domain of the model is narrow, and the solvatochromic parameter of a complex molecule is difficult to obtain, so that the method is not widely used. 192 fish logBCF values of nonionic organic compounds are collected in J.Chinese Science Bulletin,2009,54(4):628-634. ", 4 molecular structure descriptors with quantum chemical descriptors as main parts are selected according to the LSER theory, and a fish BCF-QSAR model of 8 compounds is established by adopting a Partial Least Squares (PLS) method. Regression correlation coefficient R of the model20.868, the prediction effect is good, but the process of obtaining the quantum chemical descriptor is complex, and the requirement on a computer is high, so that the method is not convenient for practical use. The document "j.sar QSAR environ.res.,2010,7-8, (21), 671-680" selects 7 descriptors, such as hydrophobicity descriptor, hydrogen bond, molecular topological index, etc., and builds a model by an Artificial Neural Network (ANN) method with 624 compounds, wherein the model has no clear expression and is inconvenient for mechanism explanation. In summary, the former model has the disadvantages of small application domain, difficult acquisition of descriptors, ambiguous algorithm and the like, and does not fully consider various requirements in the OECD guideline, and lacks model verification and characterization, so it is necessary to construct a BCF prediction model which is rich in compound types covered by a data set, has definite algorithm, convenient acquisition of descriptors and convenient application and popularization,and the model is verified and characterized according to the OECD guidelines.

Disclosure of Invention

The invention aims to develop an efficient, rapid and simple method for predicting the BCF value of an organic compound. The method comprises the steps of firstly searching Abraham parameter values (E, S, A, B and V) required in a multi-parameter linear free energy relation from a database so as to predict a chemical BCF value. BCF is a measure of the capability of enriching chemicals biologically, is an important index for describing the accumulation trend of the chemicals in organisms, and the development of a calculation method for rapidly acquiring BCF enables the ecological risk evaluation and management of the compounds to be more efficient.

The technical scheme of the invention is as follows:

a method for predicting fish biological enrichment factors of organic compounds by adopting a multi-parameter linear free energy relation model comprises the following steps:

firstly, a multi-parameter linear free energy relation model is adopted to predict BCF, wherein the multi-parameter linear free energy relation model is as follows:

lgBCF=eE+sS+aA+bB+vV+c

wherein e, s, a, b and v are model coefficients and are obtained by linear regression fitting; e is the molar excess refractive index; s is the molecular polarity/dipole moment parameter; a and B respectively represent the proton donor capacity and the proton acceptor capacity of a molecular hydrogen bond, and are also called hydrogen bond acidity and hydrogen bond alkalinity; v is the molecular volume of McGowan, E, S, A, B, V is also called Abraham parameter;

the method comprises the steps that an Abraham parameter is obtained firstly by using a multi-parameter linear free energy relation model, namely, the values of E, S, A, B and V of a compound are obtained, and an UFZ-LSER database is used for searching;

then, modeling is carried out by using collected logBCF values of 510 organic compounds, and the logBCF values are randomly divided into a training set containing 408 compounds and a verification set containing 102 compounds in a ratio of 4:1, wherein the training set is used for constructing a prediction model, and the verification set is used for external verification after modeling; constructing a QSAR model using a stepwise Multiple Linear Regression (MLR) method; the final selected model is as follows:

logBCF=0.954×E–0.463×S–0.739×A–2.612×B+1.955×V+2.195 (1)

ntra=408,R2 adj.tra=0.839,RMSEtra=0.619,Q2 LOO=0.835,next=102,R2 adj.ext=0.840,RMSEext=0.680,Q2 ext=0.848.

wherein n istraAnd nextTraining set and validation set compound numbers, respectively; r2 adjIs the decision coefficient of the correction; RMSE is the root mean square error; q2 LOOIs a one-out cross validation coefficient; q2 extIs the external verification coefficient;

the model training set contains 408 compounds, and the variance expansion factors VIF of the descriptors are all less than 3, which indicates that no multiple correlation exists among model variables; model R2 adj.traIs 0.839, RMSEtra0.619, indicating that the model has good fitting ability; q2 LOOIs 0.835, R2And Q2The difference is far less than 0.3, and the model is considered to have no overfitting phenomenon and has good robustness; during the external validation of the model, the validation set contained 102 compounds, R2 extIs 0.853, RMSEextIs 0.680, Q2 extIs 0.848, which indicates that the model has good external prediction capability and can effectively predict the BCF in the fish body;

characterizing an application domain of the model by using a Williams diagram; based on standard residual error delta of organic chemicals in model to lever value hiMaking a Williams diagram (fig. 2), characterizing the application domain of the model; the standard residual (δ) is calculated as:

Figure BDA0002264206250000031

where δ is the standard residual, yiAnd

Figure BDA0002264206250000032

respectively, the experimental value and the predicted value of the ith compound, n is the number of compounds in the data set, and p is a descriptorAnd (4) the number.

hiAnd h*Calculated by the following formula:

hi=xi T(XTX)-1xi(2)

h*=3(k+1)/n (3)

wherein x isiIs the descriptor matrix for the ith compound; x is the number ofi TIs xiThe transposed matrix of (2); x is a descriptor matrix for all compounds; xTIs the transpose of X; (X)TX)-1Is a matrix XTThe inverse of X; k is the number of variables in the model and n is the number of training set samples.

The method has the beneficial effect that the logBCF value of the organic compound can be rapidly predicted through the molecular structure characteristics. The method is simple and quick, has low cost, and saves manpower, material resources and financial resources required by experimental determination. The logBCF prediction method is established and verified strictly according to the QSAR model development and use guide rules specified by OECD, so that the prediction result of the logBCF used in the invention can provide data support for chemical supervision and has important significance for ecological risk evaluation of chemicals.

The invention has the beneficial effects that:

(1) the transparent algorithm-MLR is adopted in the modeling process, and in addition, 5 descriptors are used for constructing a prediction model, so that the model is simple and easy to explain, and is convenient to apply and popularize;

(2) the model has wide application range, covers polycyclic aromatic hydrocarbons and substitutes thereof, heterocyclic compounds and derivatives thereof, halogenated alkanes, halogenated olefins, esters, ethers, ketones, alcohols, phenols, aniline, nitro compounds and other compounds, can be used for predicting logBCF values of different organic compounds, and provides data support for ecological risk evaluation and supervision of chemicals;

(3) the modeling process strictly follows the OECD guide rules about the construction and use of QSAR models, and the constructed models have good fitting capability, robustness and prediction capability.

Drawings

Fig. 1 is a graph of a training set logBCF fitting measured values to predicted values, for 408 compounds.

Fig. 2 is a plot of the actual and predicted values for validation set logBCF, which is 102 compounds.

FIG. 3 is a Williams diagram of the model, with black solid dots representing training set compounds, black open dots representing validation set compounds, and an alert value h*Is 0.044.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于空间结构的蛋白质相互作用预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!