Method, system and storage medium for predicting drug-induced liver injury

文档序号：1955340 发布日期：2021-12-10 浏览：22次中文

阅读说明：本技术 一种预测药物性肝损伤的方法、系统及存储介质 (Method, system and storage medium for predicting drug-induced liver injury ) 是由钟涛刘盛元何岱海王健庄子安朱利清刘守江魏巍张帆范玉铮黄垚于 2021-09-14 设计创作，主要内容包括：本发明公开了一种预测药物性肝损伤的方法、系统及存储介质,涉及机器学习技术领域,具体步骤为：获取数据样本；构建XGBoost预测模型；将所述数据样本输入到所述XGBoost预测模型中,获取预测结果；根据所述预测结果进行预警提示。将药物性肝损伤预测与机器学习技术相结合,基于XGBoost预测模型进行预测,在病人药物性肝损伤发生之前提供预警信号,以帮助临床医生及时调整药物计划,并降低药物性肝损伤的可能性,除此之外,通过混淆矩阵评估预测结果,提升了模型的预测精度。(The invention discloses a method, a system and a storage medium for predicting drug-induced liver injury, which relate to the technical field of machine learning and specifically comprise the following steps: acquiring a data sample; constructing an XGboost prediction model; inputting the data sample into the XGboost prediction model to obtain a prediction result; and carrying out early warning prompt according to the prediction result. The method combines the drug-induced liver injury prediction with a machine learning technology, carries out prediction based on an XGboost prediction model, provides an early warning signal before the drug-induced liver injury of a patient occurs, helps a clinician to adjust a drug plan in time, reduces the possibility of the drug-induced liver injury, and in addition, estimates the prediction result through a confusion matrix, and improves the prediction precision of the model.)

1. A method for predicting drug-induced liver injury is characterized by comprising the following specific steps:

acquiring a data sample;

constructing an XGboost prediction model;

inputting the data sample into the XGboost prediction model to obtain a prediction result;

and carrying out early warning prompt according to the prediction result.

2. The method of claim 1, wherein the data samples are divided into a training set and a testing set.

3. The method for predicting the drug-induced liver injury according to claim 2, wherein the XGboost prediction model is constructed by the following steps:

performing layered k-fold cross validation, and training a single tree XGboost model by using the training set;

obtaining a relative importance score of each feature to the single tree XGboost model according to the contribution of each feature of each tree in the single tree XGboost model;

adding the prediction variables into the single-tree XGboost model one by one according to the descending order of relative importance to form a candidate model;

performing layered k-fold cross validation to obtain an average value of areas under the working characteristic curve of the subject, and selecting a final model based on a backward selection method;

training the final model using the training set.

4. The method of claim 1, further comprising comparing the predicted result with an actual result, and putting the data with inaccurate prediction into the XGboost prediction model for further training.

5. The method of claim 1, wherein the prediction is evaluated by a confusion matrix.

6. A system for predicting drug-induced liver injury is characterized by comprising a data acquisition module, a model construction module, a prediction module and an early warning module which are sequentially connected; wherein the content of the first and second substances,

the data acquisition module is used for acquiring data samples;

the model construction module is used for constructing an XGboost prediction model;

the prediction module is used for inputting the data sample into the XGboost prediction model to obtain a prediction result;

and the early warning module is used for carrying out early warning prompt according to the prediction result.

7. The system of claim 6, further comprising a result evaluation module for evaluating the result of said prediction by a confusion matrix.

8. A computer storage medium having a computer program stored thereon, which, when being executed by a processor, carries out the steps of a method of predicting a pharmaceutical liver injury as claimed in any one of claims 1 to 5.

Technical Field

The invention relates to the technical field of machine learning, in particular to a method, a system and a storage medium for predicting drug-induced liver injury.

Background

Tuberculosis is a chronic infectious disease which seriously harms human health, and the cure of tuberculosis is an important measure for controlling the epidemic situation of tuberculosis. The currently and conventionally used drugs for treating tuberculosis are Fixed-Dose antituberculosis compound preparations (FDC), which are compound mixed preparations prepared by various antituberculosis drugs in a chemotherapy scheme according to certain doses. However, antitubercular drugs can cause Drug-Induced Liver Injury (AT DILI). The study data showed that 27% of the tuberculosis patients enrolled for treatment in southern mountain area between 2013 and 2014 9 were discontinued due to drug-induced liver injury.

Drug-induced liver injury (DILI) refers to liver injury caused by a Drug itself or a metabolite thereof, and DILI may occur after a patient with or without a liver underlying disease has been treated with a Drug. The drug-induced hepatic injury caused by antituberculosis drugs can greatly affect antituberculosis therapy, and may result in unsuccessful therapy, poor therapeutic effect or prolonged course of treatment. In the past clinical diagnosis and treatment, even if doctors know the reason causing DILI, the doctors only avoid patients with known basic liver diseases and immunodeficiency, and after the DILI occurs, the doctors comprehensively judge the DILI degree by combining clinical experience and inspection data, and take after-the-fact intervention and treatment, thereby increasing the physical and economic burden of tuberculosis patients and influencing the final treatment effect.

Machine Learning (ML) is a multi-domain interdisciplinary, focusing on the design of algorithms that allow computers to "learn" rules from data automatically and to use the rules to predict unknown data. Through machine learning, the existing disease data can be analyzed to find the rules, and further the characteristics related to the disease can be found, so that the clinical diagnosis of the disease can be effectively assisted. In the prior art, no attempt is made to combine the prediction of the drug-induced liver injury with the machine learning technology, and for those skilled in the art, how to predict the probability of drug-induced liver injury caused by the anti-tuberculosis drug treatment of a pulmonary tuberculosis patient by using an early warning model is an urgent problem to be solved.

Disclosure of Invention

In view of the above, the invention provides a method, a system and a storage medium for predicting drug-induced liver injury, which combine drug-induced liver injury prediction with machine learning technology, and use an early warning model to predict the probability of drug-induced liver injury caused by tuberculosis patients treated by antituberculosis drugs, so that early warning is timely and accurate, and the treatment compliance and success rate of the patients can be improved to the greatest extent.

In order to achieve the purpose, the invention adopts the following technical scheme: on one hand, the method for predicting the drug-induced liver injury comprises the following specific steps:

acquiring a data sample;

constructing an XGboost prediction model;

inputting the data sample into the XGboost prediction model to obtain a prediction result;

and carrying out early warning prompt according to the prediction result.

Optionally, the data samples are divided into a training set and a test set.

Optionally, the step of constructing the XGBoost prediction model is as follows:

performing layered k-fold cross validation, and training a single tree XGboost model by using the training set;

obtaining a relative importance score of each feature to the single tree XGboost model according to the contribution of each feature of each tree in the single tree XGboost model;

adding the prediction variables into the single-tree XGboost model one by one according to the descending order of relative importance to form a candidate model;

performing layered k-fold cross validation to obtain an average value of areas under the working characteristic curve of the subject, and selecting a final model based on a backward selection method;

training the final model using the training set.

By adopting the technical scheme, the method has the following beneficial technical effects: the XGboost algorithm can well process missing values as a prediction model, and the high tolerance of the decision tree to the missing data enables the model to have robustness when processing clinical data, so that the XGboost algorithm is more conveniently applied to the clinical field.

Optionally, the method further comprises the steps of comparing the prediction result with a real result, and putting data with inaccurate prediction into the XGBoost prediction model for continuous training.

Optionally, the prediction is evaluated by a confusion matrix.

By adopting the technical scheme, the method has the following beneficial technical effects: and evaluating the prediction result by using the confusion matrix, comparing the prediction result of the patient with the real result, putting the inaccurate prediction data and result into the model for continuous training, and improving the prediction precision of the model.

On the other hand, the system for predicting the drug-induced liver injury comprises a data acquisition module, a model construction module, a prediction module and an early warning module which are sequentially connected; wherein the content of the first and second substances,

the data acquisition module is used for acquiring data samples;

the model construction module is used for constructing an XGboost prediction model;

the prediction module is used for inputting the data sample into the XGboost prediction model to obtain a prediction result;

and the early warning module is used for carrying out early warning prompt according to the prediction result.

Optionally, the system further comprises a result evaluation module, configured to evaluate the prediction result through a confusion matrix.

Finally, a computer storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of a method of predicting a pharmaceutical liver injury.

According to the technical scheme, compared with the prior art, the invention discloses the method, the system and the storage medium for predicting the drug-induced liver injury, the prediction is carried out based on the XGboost prediction model, the early warning signal is provided before the drug-induced liver injury of the patient occurs, so that a clinician is helped to adjust a drug plan in time, the possibility of the drug-induced liver injury is reduced, in addition, the prediction result is evaluated through the confusion matrix, and the prediction precision of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the system results of the present invention;

FIG. 3 is a diagram of the first 10 important variables selected by the single tree XGboost model of the present invention;

FIG. 4 is a running graph of a model single decision tree of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment 1 of the invention discloses a method for predicting drug-induced liver injury, which comprises the following specific steps as shown in figure 1:

acquiring a data sample;

constructing an XGboost prediction model;

inputting the data sample into an XGboost prediction model to obtain a prediction result;

and carrying out early warning prompt according to the prediction result.

Further, the data samples are divided into a training set and a testing set.

Specifically, in this embodiment, the data samples include:

a) demographic and clinical data including gender, age, weight, cultural degree, income, height, hepatitis b status, diabetes status of the patient are collected,

b) the total dose of each anti-tuberculosis drug which is cumulatively taken by the patient is collected, and the anti-tuberculosis drugs which need to be collected and taken are as follows: isoniazid (INH), Rifampicin (RFP), Ethambutol (EMB), Pyrazinamide (PZA) and Streptomycin (SM). For patients who did not develop liver damage, the total amount of anti-tuberculosis drugs that had been taken by the patient by the date of data statistics was collected. For patients with liver damage, the total amount of drug until liver damage is detected is collected. The normal value range of the alanine Aminotransferase (ALT) is 0-35U/L, and the condition that the alanine Aminotransferase (ALT) is more than 35U/L is judged to have liver damage.

c) Collecting the detection result of alanine Aminotransferase (ALT) in liver function examination. Comprises the value of the alanine Aminotransferase (ALT) when the patient first suffers from liver damage and the value of the alanine Aminotransferase (ALT) when the patient finally tests the liver function. In addition, it is necessary to calculate the ratio between the value of alanine Aminotransferase (ALT) at the time of the last liver function test of the patient and the value of alanine Aminotransferase (ALT) at the time of the first liver damage of the patient.

Further, the XGboost prediction model is constructed by the following steps:

a) repeating 100 times of layering 10-fold cross validation, training a single tree XGboost model through a training set, and obtaining relative importance scores of each tree feature to the tree model according to the contribution of each tree feature in the model, so as to obtain the relative importance of each input feature;

b) selecting the predictive factors of the first 15 bits of importance, and adding the predictive variables into the model one by one according to the descending order of relative importance to form 15 candidate models;

c) repeating the layering 10-fold cross validation for 100 times to obtain the average value of the area under the working characteristic curve (AUC) of the subject, and selecting the final model according to the AUC value by adopting a backward selection method;

d) the selected model is trained using the training set to obtain a final result.

The XGboost is a machine learning algorithm based on a tree lifting system, sparse data and a weighted quantile sketch are processed by using a sparse perception algorithm to approximate tree learning, and a decision tree is a simple classifier determined by a dichotomy of hierarchical organization, so that the structure of the decision tree also shows good interpretability, and in addition, the model can well process missing values. The interpretable criteria in the decision tree and the high tolerance to missing data make the model robust in processing clinical data, and more convenient to apply in the clinical field.

And further, comparing the predicted result with the real result through a confusion matrix, and putting the data with inaccurate prediction into an XGboost prediction model for continuous training. The confusion matrix is a column connection table for displaying the relation between the prediction result and the actual result, and the prediction precision of the model is improved by evaluating the result.

Furthermore, the validity of the method of the invention is verified on the real data set.

757 tuberculosis cases registered by the HIS system of the chronic disease control hospital in the southern mountain area from 2014 to 2019 were extracted in this example. Some patients do not have continuous treatment, or are transferred to hospital first and then returned to hospital, resulting in recorded treatment times exceeding the normal range and ambiguous cumulative antituberculotic doses. Such abnormal cases cannot be predicted for cases receiving conventional treatment. Thus, selecting 300 days as the time window according to a typical tuberculosis treatment procedure eliminated TB-DILI cases found 300 days after initiation of anti-tuberculosis treatment, and a total of 743 patients were finally included in the model, with patients defined as positive DILI cases according to the american thoracic society standard: in the case of hepatitis symptoms, the ALT increase is 3 times higher than the Upper Limit of Normality (ULN), and in the absence of symptoms, the ALT increase is 5 times higher than the Upper Limit of Normality (ULN).

Inputting data: for patients who do not develop liver damage: the data include gender, age, weight, cultural degree, income, height, hepatitis B status, diabetes status, cumulative antituberculosis drug dosage and ALT test results. For patients who did not develop liver damage, the total amount of prescribed anti-tuberculosis drugs was collected up to the latest liver examination. For patients with TB-DILI, the total amount of drug was collected up to the time when liver damage was detected. In addition, the latest ALT test value of the patient before the last liver examination and the average rate of change of the last two ALT test values before the final liver function test were calculated, and the cumulative amount ("PZA", "RFP", "EMB", "INH") of each component of the combination drug was calculated, respectively.

According to the time when the treatment started, the patients were divided into training and validation data sets, and patients admitted before 2019 and 4 months ago (607 patients and 186 positive cases) were used as the training data set, and patients admitted after 2019 and 4 months later (136 patients and 95 positive cases) were used as the validation data set.

As shown in fig. 3, the top 10 important variables for the single tree XGBoost model selection are shown, and it is apparent that ALT has proven to be the most important factor in the prediction process.

After 100 times of 10 cross validation training and testing, the model with four variables is found to show the maximum AUC value, as shown in table 1, and therefore, the model with four variables (the latest ALT test value, the average change rate of the last two ALT test values, the cumulative dose of PZA, the cumulative dose of EMB) is selected as the final model.

TABLE 1

The selected final model is trained based on the entire training set, the contents of a single decision tree for the model are shown in fig. 4, the decision process starts with the most recent ALT test value, then a binary determination is made at each node in the decision tree, and finally the process ends with an output prediction (high/low risk for drug-induced liver damage).

The selected XGBoost prediction model was verified by a verification data set (136 cases), as shown in table 2, the model correctly predicted 70 cases of drug-induced liver injury and successfully predicted 33 negative cases.

And performing model performance evaluation through accuracy, recall rate, classification accuracy and balance error rate, wherein the accuracy is as follows:

the recall ratio is as follows:classification accuracy:balancing error rate:wherein, TP: number of true positive cases in prediction, TN: number of true negative cases in the prediction, FP: number of false positive cases in prediction, FN: number of false negative cases in the prediction. By calculation, the accuracy of the model is 90%, the recall rate is 74%, the classification accuracy is 76%, the equilibrium error rate is 77%, and the F1 value is 81%.

TABLE 2

The embodiment 2 of the invention discloses a system for predicting drug-induced liver injury, which comprises a data acquisition module, a model construction module, a prediction module and an early warning module which are sequentially connected, as shown in fig. 2; wherein the content of the first and second substances,

the data acquisition module is used for acquiring data samples;

the model construction module is used for constructing an XGboost prediction model;

the prediction module is used for inputting the data sample into the XGboost prediction model to obtain a prediction result;

and the early warning module is used for carrying out early warning prompt according to the prediction result.

Further, the system also comprises a result evaluation module which is used for evaluating the prediction result through the confusion matrix.

Finally, a computer storage medium is provided, having stored thereon a computer program which, when executed by a processor, carries out the steps of a method of predicting a pharmaceutical liver injury.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种老人夜间风险防控方法、系统、装置及存储介质

Method, system and storage medium for predicting drug-induced liver injury

相关技术

网友询问留言