Device and method for constructing pressure damage risk prediction model

文档序号：193486 发布日期：2021-11-02 浏览：35次中文

阅读说明：本技术 一种构建压力性损伤风险预测模型的装置及方法 (Device and method for constructing pressure damage risk prediction model ) 是由韩琳张红燕苏茜蒋梦瑶于 2021-07-29 设计创作，主要内容包括：本发明涉及一种构建压力性损伤风险预测模型的装置和方法,装置包括处理模块,处理模块配置为：对病历数据进行筛选获取第一病历数据；基于随机森林模型对第一病历数据进行分类以获取造成压力性损伤的多个第一风险变量；基于多元逻辑回归模型对第一病历数据中的多个第一风险变量进行回归以获取多个第一风险变量之间的关于递进关系的第一权值；基于第一权值对第一病历数据进行划分形成多个第二病历数据,并采用随机森林模型对多个第二病历数据进行建模以生成多个第一风险预测模型。(The invention relates to a device and a method for constructing a pressure injury risk prediction model, wherein the device comprises a processing module, and the processing module is configured to: screening medical record data to obtain first medical record data; classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury; performing regression on a plurality of first risk variables in the first medical record data based on a multiple logistic regression model to obtain a first weight value about a progressive relation among the plurality of first risk variables; the first medical record data are divided based on the first weight values to form a plurality of second medical record data, and a random forest model is adopted to model the plurality of second medical record data to generate a plurality of first risk prediction models.)

1. An apparatus for constructing a stress injury risk prediction model, comprising a processing module (100), the processing module (100) being configured to:

screening medical record data to obtain first medical record data;

classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury;

performing regression on a plurality of first risk variables in the first medical record data based on a multiple logistic regression model to obtain a first weight value about a progressive relation among the plurality of first risk variables;

2. The apparatus of claim 1, wherein the processing module (100) is configured to:

and under the condition that the plurality of first risk prediction models are classified to generate a plurality of second risk variables, adjusting the number of the second risk variables and a second weight value representing the correlation degree of the generated pressure damage through cross validation of the plurality of first risk prediction models.

3. The apparatus according to any of claims 1 or 2, wherein the processing module (100) is configured to filter the medical record data as follows:

searching the disease state in the medical record data at the time of admission, and eliminating the medical record data with pressure injury at the time of admission;

acquiring first time information of the occurrence of the pressure injury in medical record data which does not have the pressure injury during admission;

and eliminating medical record data of which the first time information is less than the first threshold value and which do not generate pressure injury when being admitted, thereby obtaining first medical record data.

4. The apparatus according to any of the preceding claims, wherein the processing module (100) is configured to construct the database in the following way:

performing module classification on medical record data and distributing a first key value pair to each module;

constructing a first hash table based on the first key-value pair;

assigning a second key-value pair to the content within the module;

and constructing a second hash table based on the second key-value pair.

5. The apparatus according to any of the preceding claims, wherein the processing module (100) is configured to:

establishing a multiple logistic regression model by taking the first risk variables in the first medical record data as independent variables and whether the progression among the first risk variables is a dependent variable;

and acquiring progressive relationships among the plurality of first risk variables based on the multiple logistic regression model.

6. The apparatus according to any of the preceding claims, wherein the processing module (100) is configured to partition the first medical record data into a plurality of second medical record data based on the first weight as follows:

constructing a progressive relation table based on each first risk variable;

acquiring a first risk variable pair of which the first weight is smaller than a second threshold;

calculating the number of the first risk variables corresponding to the first risk variables based on the progressive relation table;

and if the number of the same first risk variables exceeds a third threshold, searching the next pair of first risk variables with the first weights smaller than the second threshold.

7. The apparatus according to any of the preceding claims, wherein the processing module (100) is configured to:

and acquiring a second risk variable and a second weight of the first risk prediction model based on the Gini coefficient as a splitting or competition rule of the random forest model, wherein the second weight is the Gini coefficient.

8. The apparatus according to any of the preceding claims, wherein in case of failure to partition the first medical record data based on the first weight, the processing module (100) is configured to:

establishing a multivariate logistic regression model by taking the first risk variables in the first medical record data as independent variables and the correlation degree between the first risk variables as dependent variables;

obtaining the correlation degree among a plurality of first risk variables based on a multiple logistic regression model;

the first medical record data is divided based on the degree of association to generate second medical record data.

9. A method for constructing a pressure injury risk prediction model is characterized by comprising the following steps:

screening medical record data to obtain first medical record data;

classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury;

10. The method according to claim 9, wherein, in the case that the plurality of first risk prediction models are classified to generate a plurality of second risk variables, the plurality of first risk prediction models are cross-validated to adjust the number of second risk variables and the second weight value representing the correlation degree of the occurrence of the stress damage.

Technical Field

The invention relates to the technical field of medical data processing, in particular to a device and a method for constructing a pressure injury risk prediction model.

Background

Pressure injury refers to local damage to the skin or subcutaneous soft tissue, usually located at the bony prominences or associated with iatrogenic equipment. The lesion may be an intact skin or an open wound, and may be accompanied by a sensation of pain. The damage occurs from intense and/or prolonged pressure or pressure combined with shear forces. The tolerance of soft tissue to pressure and shear forces may also be affected by Microclimate (Microclimate), nutrition, tissue perfusion, complications, and soft tissue conditions. The pressure injury seriously affects the life quality of the patient, prolongs the hospitalization time, aggravates the disease condition, increases the burden of family and social economy, consumes a large amount of medical resources and even leads to the death of the patient. Therefore, it has become a global consensus that prevention of stress injuries is the most cost effective means.

The risk prediction is the first measure for preventing the stress injury, and whether the accuracy of the risk prediction result directly influences the selection and prevention effect of the preventive measure.

For example, patent document No. CN111195180A addresses the problem that Braden estimation tables cannot predict stress injuries in most individuals, providing a system for determining a target stress injury score and modifying a treatment plan based thereon. The system includes a plurality of sensors coupled to a person support, the person support configured to support a person on a support surface supported by the person; at least one humidity sensor configured to sense a humidity level between the person and the support surface; and at least one computing device coupled to the plurality of sensors coupled to the person support and the at least one humidity sensor. The at least one computing device includes a processor and a memory storing computer readable and executable instructions that, when executed by the processor, cause the computing device to receive data from a plurality of sensors and at least one moisture sensor coupled to a person support, obtain data from an electronic medical record associated with a person supported by the person support, calculate a stress injury score indicative of a likelihood of the person developing a stress injury based on the data from the plurality of sensors, the moisture sensor, and the electronic medical record, and modify a treatment plan for the person based on the calculated stress injury score. In particular, the stress-injury score indicates a likelihood of the person developing a stress injury, the stress-injury score calculated by adjusting a baseline injury score for the facility based on a head angle of the person support, activity of the person, humidity, age of the person, and gender of the person; and altering the treatment plan of the person based on the calculated stress injury score. The stress injury score may use a non-linear regression model that limits the likelihood of stress injury to occur to between 0 and 1 to determine whether the person is likely to develop stress injury. Also, the baseline barotrauma value may be modified by weighting the data received from the various sensors. The weighting of the factors may depend, for example, on the amount of data received and the particular non-linear regression employed. It is contemplated that any suitable non-linear regression may be employed to calculate the stress injury score, as long as the model limits the probability between 0 and 1. Although this document uses a non-linear regression model to assess the likelihood of stress injury, the following problems exist with the use of non-linear regression for stress injury prediction and baseline stress injury score modification:

1. the problem of multivariate commonality faced by a pressure injury risk model cannot be solved by nonlinear regression;

2. when the linear relationship between the risk factors and the risk of occurrence of the stress injury is not established or there is an interaction between a plurality of risk factors, the nonlinear regression ignores the complex relationship between independent variables (risk factors).

The above problems make the technical solution disclosed in this patent document unable to accurately predict the occurrence of the pressure injury, and the ability and stability of assessing the risk of the pressure injury are questionable, and it is uncertain whether to adapt to other people, and also uncertain whether to identify the true risk of the pressure injury and the true risk-free patient.

Furthermore, on the one hand, due to the differences in understanding to the person skilled in the art; on the other hand, since the inventor has studied a lot of documents and patents when making the present invention, but the space is not limited to the details and contents listed in the above, however, the present invention is by no means free of the features of the prior art, but the present invention has been provided with all the features of the prior art, and the applicant reserves the right to increase the related prior art in the background.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a device for constructing a pressure injury risk prediction model, which comprises a processing module. The processing module is configured to:

screening medical record data to obtain first medical record data;

classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury;

the first medical record data are divided based on the first weight values to form a plurality of second medical record data, and a random forest model is adopted to model the plurality of second medical record data to generate a plurality of first risk prediction models. The following problems exist for stress injury prediction and baseline stress injury score modification using non-linear regression:

1. the problem of multivariate commonality faced by a pressure injury risk model cannot be solved by nonlinear regression;

2. non-linear regression ignores complex relationships between risk variables when a linear relationship between the risk variables and the risk of a stress injury occurring is not established, or there is interaction between multiple risk variables. Therefore, in order to solve the problem that the real effective risk variables cannot be accurately screened out by using nonlinear regression due to the multivariate commonality and the interaction existing in the risk variables, the regression prediction of the medical record data can be carried out by adopting a random forest model. However, when solving the regression prediction problem, the random forest model cannot provide continuous output. This is because the random forest model generally outputs results by using an averaging method, a voting method, and a learning method. The averaging method is generally used for the regression prediction problem, and the average value of each decision tree is used to obtain the final prediction output, and the obtained values are all discrete values. The voting method and the learning method output numerical values in the same way, so that when regression prediction is carried out, the random forest model cannot make prediction exceeding the data range of the training set, and further, when specific noise exists in medical record data, an overfitting phenomenon occurs when modeling is carried out by using the random forest model. Therefore, the invention classifies the multi-pathology data by adopting the random forest model, and can further comprehensively screen out the risk variable related to the pressure injury, namely the first risk variable. And then modeling the screened first risk variables based on a multiple logistic regression model so as to obtain a progressive relation between the first risk variables, further screening according to the progressive relation between the first risk variables to obtain relatively isolated variables in the first risk variables, and classifying the first medical record data according to the isolated variables to obtain second medical record data. Through this setting mode, the beneficial effect who reaches is:

the second medical record data obtained by classifying the first medical record data through the first weight is equivalent to classifying the data of specific noise in the first medical record data, and random forest model modeling is performed after the data of the same specific noise is divided into the same group, so that the influence caused by noise can be remarkably reduced, the phenomenon of over-fitting is avoided, and the constructed risk prediction model can be applied to new medical record data in a generalization mode.

According to a preferred embodiment, the processing module is configured to: and under the condition that the plurality of first risk prediction models are classified to generate a plurality of second risk variables, adjusting the number of the second risk variables and a second weight value representing the correlation degree of the generated pressure damage through cross validation of the plurality of first risk prediction models.

According to a preferred embodiment, the processing module is configured to filter the medical record data as follows:

searching the disease state in the medical record data at the time of admission, and eliminating the medical record data with pressure injury at the time of admission;

acquiring first time information of the occurrence of the pressure injury in medical record data which does not have the pressure injury during admission;

According to a preferred embodiment, the processing module is configured to build the database as follows:

performing module classification on medical record data and distributing a first key value pair to each module;

constructing a first hash table based on the first key-value pair;

assigning a second key-value pair to the content within the module;

and constructing a second hash table based on the second key-value pair.

According to a preferred embodiment, the processing module is configured to:

and acquiring progressive relationships among the plurality of first risk variables based on the multiple logistic regression model.

According to a preferred embodiment, the processing module is configured to divide the first medical record based on the first weight to form a plurality of second medical record data as follows:

constructing a progressive relation table based on each first risk variable;

acquiring a first risk variable pair of which the first weight is smaller than a second threshold;

calculating the number of the first risk variables corresponding to the first risk variables based on the progressive relation table;

and if the number of the same first risk variables exceeds a third threshold, searching the next pair of first risk variables with the first weights smaller than the second threshold.

According to a preferred embodiment, the processing module is configured to:

According to a preferred embodiment, in case that the first medical record data is not divided based on the first weight, the processing module is configured to:

obtaining the correlation degree among a plurality of first risk variables based on a multiple logistic regression model;

the first medical record data is divided based on the degree of association to generate second medical record data.

The invention also provides a method for constructing a pressure injury risk prediction model, which comprises the following steps:

screening medical record data to obtain first medical record data;

classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury;

According to a preferred embodiment, when the plurality of first risk prediction models are classified to generate the plurality of second risk variables, the number of the second risk variables and the second weight value representing the correlation degree of the occurrence of the pressure damage are adjusted through cross validation of the plurality of first risk prediction models.

Drawings

FIG. 1 is a schematic block diagram of a preferred embodiment of the apparatus of the present invention;

FIG. 2 is a schematic flow chart of the steps of a preferred embodiment of the process of the present invention.

List of reference numerals

100: the processing module 200: the storage module 300: communication module

Detailed Description

The following detailed description is made with reference to the accompanying drawings.

The risk prediction model is a tool for predicting the absolute probability of occurrence or future occurrence of a certain disease of an individual through multi-factor analysis on the basis of multiple diseases. The pressure injury risk prediction model aims at accurately predicting the risk of pressure injury and is convenient for medical care personnel to take targeted measures in time. The prediction performance and consistency are main indexes for evaluating the quality of the prediction model.

The predictive performance can be evaluated by using such factors as sensitivity, specificity, and area under receiver operating characteristic curve (ROC) (AUC). Sensitivity is used to characterize the ability of risk prediction models to screen truly sick patients. The specific characterization risk prediction model excludes the ability of truly non-sick patients. The area under the receiver operating characteristic curve (ROC) (AUC) is generally 0.5-1, and is a comprehensive index for evaluating the predictive performance of the risk prediction model. A larger AUC value indicates a higher authenticity.

This is illustrated to further explain the entry of AUC into the confusion matrix. The confusion matrix includes Positive (Positive) and Negative (Negative). True (True) if the prediction is correct. The prediction error is False (False). The confusion matrix includes true yang, false yang, true yin and false yin, as shown in table 1.

TABLE 1 confusion matrix

True positives can be denoted by TP. The number of true positive samples indicates the number of people who are classified as sick by the truly sick patients, i.e. the actual value is 1, and the predicted value is 1.

False positives may be represented by FP. The number of false positive samples indicates the number of healthy patients classified as sick, with an actual value of 0 and a predicted value of 1.

True negative can be represented by TN. The number of true and negative samples indicates the number of healthy patients classified as disease-free, and the actual and predicted values are both 0.

False negatives can be indicated by FN. The number of false negative samples indicates the number of people classified as disease-free, the actual value is 1, and the predicted value is 0.

The sensitivity can be expressed in terms of true positive probability. The true positive probability is used to represent the probability that a sick patient is classified as sick and the sensitivity can be characterized by the following formula.

Specificity can be expressed in terms of true negative probability. The true-negative probability is used to represent the probability that a healthy patient is classified as disease-free, and the specificity can be characterized by the following formula.

AUC represents the area under the receiver operating characteristic curve (ROC). The vertical axis of the ROC curve is sensitivity. The horizontal axis of the ROC curve is 1-S_PI.e. the probability of a false positive. The function of the ROC curve is characterized as S_E＝F(1-S_P). AUC is curve S_E＝F(1-S_P) In the process of S_EAnd 1-S_PThe area in the enclosed rectangular frame. An AUC of 1 is expressed as the most ideal case, indicating that neither truly sick patients nor healthy patients are misclassified as disease-free, i.e., the AUC is used to characterize the discriminative power of the stress injury risk prediction model.

Preferably, the consistency can be evaluated by Goodness Of Fit (Goodness Of Fit, GOF). When the P value of the risk prediction model is larger than 0.05, the risk prediction model is shown to fully extract the information in the data, and the goodness of fit is high. The P value represents: the probability of the current situation or worse when the original hypothesis is assumed to be correct.

The random forest model is formed by combining classification trees into a random forest, and two times of randomness are used in the construction process of each decision tree: firstly, training data used in the construction of a decision tree are randomly acquired from original data through a bootstrap method; secondly, the interpretation variables used by each decision tree are randomly acquired on the original characteristic set to generate a plurality of classification trees, and then the results of the classification trees are summarized.

The multiple logistic regression model equation can be expressed as:

logit(P)＝β₀+β₁X₁+…+β_nX_n

where logit (×) represents a multiple logistic regression function. P represents a P value. n represents the number of arguments. Beta is a_nThe regression coefficients are represented.

Example 1

The invention provides a device for constructing a pressure injury risk prediction model. Referring to fig. 1, the apparatus includes a processing module 100, a storage module 200, and a communication module 300.

Preferably, the Processing module 100 may be a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Graphics Processing Unit (GPU), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof.

Preferably, the storage module 200 may be a magnetic disk, a hard disk, an optical disk, a removable hard disk, a solid state disk, a flash memory, etc.

Preferably, the communication module 300 is used for accessing a network and connecting devices. The devices can be sensors, memory, mobile devices, devices storing medical record data, and the like. The communication module 300 can be connected to the medical records database by wire and/or wirelessly. The medical records database can be a database stored by a hospital about medical records. The database may be configured within a server. The communication module 300 may access the internet, the internet of things, a mobile network, an ethernet network, and the like in a wired and/or wireless manner. The communication module 300 may be an RJ-45 interface for ethernet, a BNC interface for fine coaxial cable, a coarse coaxial cable AUI interface, an FDDI interface, an ATM interface, etc. The communication module 300 may also be a Wi-Fi module, a bluetooth module, a Zigbee module, etc. Preferably, the communication module 300 may also be a combination of RJ-45 interface, BNC interface, thick coaxial AUI interface, FDDI interface, ATM interface, Wi-Fi module, bluetooth module, Zigbee module.

Preferably, the processing module 100 is configured to construct the stress injury risk prediction model according to the following steps:

screening medical record data to obtain first medical record data;

classifying the first medical record data based on a random forest model to obtain a plurality of first risk variables causing the pressure injury;

the first medical record data are divided based on the first weight values to form a plurality of second medical record data, and a random forest model is adopted to model the plurality of second medical record data to generate a plurality of first risk prediction models. Preferably, the following problems exist for stress injury prediction and baseline stress injury score modification using non-linear regression:

1. the problem of multivariate commonality faced by a pressure injury risk model cannot be solved by nonlinear regression;

2. non-linear regression ignores complex relationships between risk variables when a linear relationship between the risk variables and the risk of a stress injury occurring is not established, or there is interaction between multiple risk variables. Therefore, in order to solve the problem that the real effective risk variables cannot be accurately screened out by using nonlinear regression due to the multivariate commonality and the interaction existing in the risk variables, the regression prediction of the medical record data can be carried out by adopting a random forest model. However, when solving the regression prediction problem, the random forest model cannot provide continuous output. This is because the random forest model generally outputs results by using an averaging method, a voting method, and a learning method. The averaging method is generally used for the regression prediction problem, and the average value of each decision tree is used to obtain the final prediction output, and the obtained values are all discrete values. The voting method and the learning method output numerical values in the same way, so that when regression prediction is carried out, the random forest model cannot make prediction exceeding the data range of the training set, and further, when specific noise exists in medical record data, an overfitting phenomenon occurs when modeling is carried out by using the random forest model. Therefore, the invention classifies the random forest model multi-history data, and can comprehensively screen out the risk variable related to the stress injury, namely the first risk variable. And then modeling the screened first risk variables based on a multiple logistic regression model so as to obtain a progressive relation between the first risk variables, further screening according to the progressive relation between the first risk variables to obtain relatively isolated variables in the first risk variables, and classifying the first medical record data according to the isolated variables to obtain second medical record data. Through this setting mode, the beneficial effect who reaches is:

the second medical record data obtained by classifying the first medical record data through the first weight is equivalent to classifying the data of specific noise in the first medical record data, and random forest model modeling is performed after the data of the same specific noise is divided into the same group, so that the influence caused by noise can be remarkably reduced, the phenomenon of over-fitting is avoided, and the constructed risk prediction model can be generalized (applied) to new medical record data. For ease of understanding, the following is described in terms of stress injury risk prediction:

because the first medical record data comprises a plurality of different patients, including pressure injury patients and non-pressure injury patients. And different types of pressure injury patients are included for the pressure injury patients. For example, patients with pressure-induced injuries after surgical treatment, such as pressure-induced injuries in Intensive Care Units (ICU) for long periods of time, such as pressure-induced injuries with diabetic complications. Therefore, according to the regression prediction of the first medical record data by using the random forest model, the included first risk variable is relatively comprehensive, and the first risk variable irrelevant to the patient is also introduced relative to different patients in the first medical record data. And because the output result of the random forest model is a discrete variable, the irrelevant first risk variables are also calculated, and the random forest model is equivalent to a specific noise, the random forest model learns the specific noise on training data, the mean square error of the output of the random forest model is large, the fitting result is a distorted and continuously fluctuating curve, namely, the overfitting problem occurs, and the obtained pressure damage prediction model cannot be applied to a new data sample. According to the method, binary regression prediction is carried out on the first risk variables through the multiple logistic regression model, and the first weight values of progressive relations among the multiple first risk variables are obtained. And quantitatively evaluating the progressive relation among the plurality of first risk variables through the first weight, so that relatively isolated first risk variables in the plurality of first risk variables can be obtained. And evaluating the isolation degree of the first risk variable according to the first weight, and dividing the first medical record data according to the isolation degree of the first risk variable to further obtain second medical record data. At this time, the medical record data in the second medical record data are medical record data with similar risk variable association degree and same/similar progressive relation, so that specific noise is reduced to a greater extent, namely, interference caused by the specific first risk variable is reduced, and the over-fitting problem of the random forest model is avoided.

Preferably, the processing module 100 is configured to:

and modeling by dividing the plurality of second medical record data to obtain a plurality of first risk prediction models, wherein the plurality of first risk prediction models are suitable for the second medical record data with different characteristics. Thus, in risk prediction, a first risk variable of a patient medical record needs to be identified and then assigned to a corresponding first risk prediction model.

However, the following problems exist in practical application:

1. risk variables of the first risk prediction model are not characterized, so that risk factors or variables which can obviously characterize the first risk prediction model cannot be obtained, and the matching of patients is inconvenient;

2. the first risk prediction model is not subjected to cross validation, the capability of resisting other irrelevant risk variables cannot be guaranteed, and the problem of poor stability may exist;

according to the method, the first risk prediction model is classified again to obtain a second risk variable representing the model characteristics of the first risk prediction model and a second weight. The second weight value represents the degree of correlation thereof with the occurrence of the stress injury in the first risk prediction model. The patient's medical record data can be adapted in actual use according to the second weight of the second risk variable. Moreover, the number and the second weight of the second risk variables are adjusted through cross validation of different first risk prediction models, so that the accuracy of the second risk variable representing the first risk prediction model can be further improved on the basis of improving the stability degree of the model.

Preferably, the processing module 100 is configured to acquire medical record data of an external organization via the communication module 300. The external institution may be a hospital, a disease center, or an associated institution that stores patient medical records. Preferably, the processing module 100 can connect to a database of an external organization through the communication module 300 to request medical record data. The medical record data transmitted by the communication module 300 can be temporarily or permanently stored in the storage module 200. Since there are many databases of external institutions and many types of people, it is necessary to process externally accessed medical record data. The medical record data in the storage module 200 is filtered by the processing module 100.

Preferably, the processing module 100 is configured to filter the medical record data as follows:

searching the disease state in the medical record data at the time of admission, and eliminating the medical record data with pressure injury at the time of admission;

acquiring first time information of the occurrence of the pressure injury in medical record data which does not have the pressure injury during admission;

Preferably, the data that does not cause the pressure injury at the time of admission can be obtained by excluding the medical record data that causes the pressure injury at the time of admission from the medical record data. The first time information is the time when the pressure injury occurs after admission. The first threshold may be set as desired, for example, 24 hours, 10 days, 20 days, etc. In order to ensure the validity of medical record data for learning training, time-dependent factors need to be considered. For example, a patient's medical history of pressure-related injuries within 24 hours of admission needs to be excluded. Since the stress injury occurring shortly after admission is likely to be related to factors associated with non-admission.

Preferably, the invention processes the data in the form of a heterogeneous database for the convenience of data processing and the speed of building and training the model. Preferably, the processing module 100 is configured to build a database based on the memory module 200. The processing module 100 is configured to build the database as follows:

performing module classification on medical record data and distributing a first key value pair to each module;

constructing a first hash table based on the first key-value pair;

assigning a second key-value pair to the content within the module;

and constructing a second hash table based on the second key-value pair.

Preferably, the module includes patient basic information, laboratory examination information, medication, disease conditions, and stress injury risk factors. And a first hash table storage module is adopted. The second hash table is used to store specific values within the module. For example, the first key-value pair assigned to the patient basic information is A-1. The laboratory exam information is assigned a first key-value pair of B-2. For example, the contents of the patient basic information include sex, age, and time of admission. The second key-value pair of the gender assignment of the patient basic information may be represented as Aa- (0,1), where 0 represents male and 1 represents female.

Preferably, the processing module 100 is configured to:

digitizing characters in the first medical record data;

and carrying out dimension normalization processing on the digitized first medical record data. Preferably, because the characterization of patient information in medical record data may not be numerical, it is desirable to convert such information into a numerical value that the model can recognize. For example, 2, 8, or other multilevel representations may be employed. The patient information may be a first risk variable, a second risk variable, or other risk variable related to the stress injury. For example, a eating condition may be 0 for poor eating and 1 for normal eating. Incontinence can be expressed as 1 for total control, 2 for occasional incontinence, 3 for macro/urinary incontinence and 4 for faecal incontinence. The skin type may be represented by 1 for normal, 2 for thin, 3 for dry, 4 for edema, 5 for moist, 6 for color difference, 7 for dehiscence, etc.

Preferably, the international system of units conversion factor can be used for processing. For example, conversion of creatinine to micromoles per liter requires multiplication by 88.4; conversion of glucose to millimoles per liter requires multiplication by 0.0555. Preferably, the dimensional normalization process includes normalizing all variables to a range of 0-10. The normalization process can be that the minimum value of the variable in the medical record data is subtracted from the current value and then divided by the difference between the maximum value and the minimum value of the variable, and then the value is amplified by 10 times in an equal proportion. Through this setting mode, the beneficial effect who reaches is:

data are generally normalized to be 0-1 by adopting a random forest model, a multiple logistic regression model, a support vector machine algorithm and the like in the prior art, but by adopting the setting mode, more decimals can be generated during subsequent computer calculation, and then a large amount of floating point operation is needed by the computer, so that a large amount of calculation overhead is consumed.

Preferably, the first risk variable is represented as a risk variable for the occurrence of a stress injury. The first risk variables may include multiple variables such as the department of the hospital, the time of the hospital stay, sex, age, obesity (BMI), atherosclerosis, time of surgery, medication, malnutrition, mobility, etc. Typically, gender, age, BMI, time of surgery, etc. are common relevant variables. The first medical record data obtained by screening usually comprises a plurality of conditions, and if a random forest model is directly used for regression prediction, specific noise can generate an overfitting problem. For example, in the first medical record data, risk variables which are noise from each other exist between the pressure injury generated by the operation and the pressure injury generated by the non-operation, so that the output result includes more discrete noise, and overfitting is further caused.

Preferably, the processing module 100 is configured to:

and acquiring progressive relationships among the plurality of first risk variables based on the multiple logistic regression model. Preferably, the first risk variable is chosen randomly. Calculating progressive relationships between the first risk variable and other first risk variables based on a multiple logistic regression model. Preferably, the progressive relationship indicates whether the first risk variable a results in the production of the first risk variable B. Alternatively, the progressive relationship represents the probability that the first risk variable a produces the first risk variable B. For example, obesity, the first risk variable, may lead to the development of diabetes, the first risk variable. Preferably, the progressive relationship may also indicate that the first risk variable a results in the production of a first risk variable B, which results in the production of a first risk variable C. For example, a first risk variable surgical procedure results in the production of a first risk variable bleeding volume that results in the production of a first risk variable pressure hemostasis time. Preferably, the first weight value is a probability that the first risk variable produces other first risk variables. And under the condition of prediction by using the multiple logistic regression model, the first weight is the prediction probability calculated by the multiple logistic regression model. Preferably, the first weight value can also be represented by (x, y). x represents the number of variables that this first risk variable produces the experience of the other first risk variables. For example, the first risk variable a directly yields the first risk variable B, then x equals 0 and y equals the probability of predicted yield. If the first risk variable produces a first risk variable C by a first risk variable B, then x is 1 and y is equal to the product of the probability of producing the first risk variable B and the probability of producing the first risk variable C by the first risk variable B.

Preferably, the processing module 100 is configured to divide the first medical record data based on the first weight to form a plurality of second medical record data as follows:

constructing a progressive relation table based on each first risk variable;

acquiring a first risk variable pair of which the first weight is smaller than a second threshold;

calculating the number of the first risk variables corresponding to the first risk variables based on the progressive relation table;

and if the number of the same first risk variables exceeds a third threshold, searching the next pair of first risk variables with the first weights smaller than the second threshold. Preferably, if the number of the same first risk variables is less than or equal to the third threshold, the first risk variable with the smallest number of the first risk variables generating other first risk quantities is selected as the isolated first risk variable. The processing module 100 is configured to select medical record data including the isolated first risk variable as second medical record data based on the first medical record data. The second threshold may be chosen to be a value close to zero. The second threshold may be set according to the actually obtained first weight. Preferably, the second threshold may be a value less than 20% of the average value of the first weight. Preferably, the third threshold value may be set according to the number of first risk variables involved. The third threshold may be 40% of the total number of first risk variables.

Preferably, the processing module 100 is configured to:

and obtaining a second risk variable and a second weight of the first risk prediction model based on the Keyny coefficient as a splitting or competition rule of the random forest model. Preferably, the second weight is a kini coefficient. The second weight represents a degree of association of the second risk variable with the stress injury. And (3) extracting N samples from the second medical record data by using a Boos-strap sampling method through a random forest algorithm, then respectively establishing decision tree models for the N samples, wherein each decision tree consists of a root node, leaf nodes and branches, each decision tree model comprises 4 random variable attributes, splitting the node in an optimal splitting mode in 4 characteristics, and each tree grows completely without pruning to obtain a combined classifier. And classifying each test sample by utilizing N decision tree models to obtain N classification results, and finally voting the N classifications to determine the final classification result. Preferably, the expression of the pre-grouping kini coefficient g (t) is as follows:

preferably, p (j | t) represents the normalized probability that the output variable takes the jth class in node t. When the output quantities of the node samples are the same sample, the difference of the values of the output variables is minimum, and the Gini coefficient is 0. When the probabilities of the values of the categories are the same, the difference of the values of the output variables is the largest, and the coefficient of the kini is also the largest.

Preferably, the classification tree measures the degree of decrease Δ g (t) of heterogeneity using the amount of decrease in the kini coefficient. Preferably, a simple majority voting method may be employed to decide the final classification result. The final classification decision is as follows:

where H (x) represents a combined classification model. h is_i(x) Representing a single decision classification model. Y represents a target variable. I (-) represents an illustrative function. The whole process is repeated k times. Samples that have never been drawn are referred to as out-of-bag data. Preferably, the effect of the model can be measured as the mean square of the residuals of the out-of-bag data predictors.

Preferably, the processing module 100 is configured to perform the following steps:

prediction error rate e of ith decision tree and out-of-bag prediction_i；

Randomly disturbing the value sequence of the prediction outside the bag on the jth input variable;

reestablishing the ith classification regression tree and predicting the observation outside the bag;

recalculating prediction error of ith classification regression treePreferably, the first and second electrodes are formed of a metal,showing the variation of the prediction error of the ith classification regression tree caused by adding noise to the jth input variable. Preferably, repeating the above steps results in M changes in the prediction error.And adding noise to the jth input variable to cause average change of the overall prediction error of the random forest. From this average change, an average kini coefficient can be obtained. Preferably, the second weight may be characterized by an average kini coefficient.

It should be noted that, in the process of dividing the first medical record data based on the first weight, the first medical record data may contain fewer isolated first risk variables, and thus the first medical record data may be too few to be divided. Preferably, the processing module 100 is configured to:

obtaining the correlation degree among a plurality of first risk variables based on a multiple logistic regression model;

the first medical record data is divided based on the degree of association to generate second medical record data. Through this setting mode, the beneficial effect who reaches is:

although the isolated first risk variables cannot be accurately obtained by calculating the degree of association between the first risk variables, and thus the specific noise cannot be eliminated to the maximum extent, the risk of division failure due to a small amount of related data can be avoided by dividing through the degree of association between the first risk variables.

Preferably, the first risk variable is chosen randomly. Calculating the degree of correlation between the first risk variable and the other first risk variables based on a multiple logistic regression model. Preferably, the degree of correlation may be characterized by calculating a regression coefficient. For example, a first risk variable a is randomly selected, and regression coefficients with other first risk variables are calculated based on the first risk variable a. The regression coefficients characterize the degree of change of the other first risk variables when the first risk variable a changes. For example, when the first risk variable a varies by one unit and the associated first risk variable B varies by 1 unit, the degree of association is 1. If the first risk variable a varies by 1 unit and the associated first risk variable B varies by 0.1 units, the degree of association is 0.1. Preferably, the plurality of first risk variables with the association degree greater than the fourth threshold are screened based on the association degree of the first risk variables. The fourth threshold may be set based on the actual number of first risk variables and the medical record data. Preferably, the fourth threshold may be a median of the degree of correlation.

averaging the data volume of the second risk variable in the second medical record data;

the second risk variable is partitioned based on the degree of association, thereby generating a plurality of third risk variables. Preferably, the modeling based on the plurality of third risk variables generates a second risk prediction model. Preferably, each divided type of third risk variable contains the same number of second risk variables. Through this setting mode, the beneficial effect who reaches is:

since the generated first risk prediction model needs to have the expansion capability of incorporating new risk variables, the first risk prediction model is required to ensure the stability of the prediction when the new risk variables are incorporated. However, since the first risk prediction model is constructed based on the random forest model, if a new risk variable is included and the data size thereof is large, the output of the first risk prediction model may be inclined to the side with a large data size/data record, and thus, the prediction result can be prevented from being skewed by averaging the data size of the second risk variable in the second medical record data. In addition, if there are many associated risk variables in the second risk variables, the output of the first risk prediction model also inclines to the side of the associated risk variables, so the present invention obtains a plurality of third risk variables by the association degree division, and the plurality of third risk variables include the same number of second risk variables, so that the classification number of the risk variables is balanced, thereby avoiding the inclination of the risk prediction result.

Example 2

The invention provides a method for constructing a pressure injury risk prediction model. The method may be carried out by the apparatus of the invention and/or other alternative components. In the case where no conflict/contradiction occurs, the automatic leveling method of the present embodiment can be implemented by the apparatus provided in embodiment 1.

As shown in fig. 2, the method includes the following steps.

S100: and screening the medical record data to obtain first medical record data. Preferably, the data that does not cause the pressure injury at the time of admission can be obtained by excluding the medical record data that causes the pressure injury at the time of admission from the medical record data. The first time information is the time when the pressure injury occurs after admission. The first threshold may be set as desired, for example, 24 hours, 10 days, 20 days, etc. In order to ensure the validity of medical record data for learning training, time-dependent factors need to be considered. For example, a patient's medical history of pressure-related injuries within 24 hours of admission needs to be excluded. Since the stress injury occurring shortly after admission is likely to be related to factors associated with non-admission.

performing module classification on medical record data and distributing a first key value pair to each module;

constructing a first hash table based on the first key-value pair;

assigning a second key-value pair to the content within the module;

and constructing a second hash table based on the second key-value pair.

Preferably, the characters in the first medical record data are digitized. And carrying out dimension normalization processing on the digitized first medical record data. Preferably, because the characterization of patient information in medical record data may not be numerical, it is desirable to convert such information into a numerical value that the model can recognize. For example, 2, 8, or other multilevel representations may be employed. The patient information may be a first risk variable, a second risk variable, or other risk variable related to the stress injury. For example, a eating condition may be 0 for poor eating and 1 for normal eating. Incontinence can be expressed as 1 for total control, 2 for occasional incontinence, 3 for macro/urinary incontinence and 4 for faecal incontinence. The skin type may be represented by 1 for normal, 2 for thin, 3 for dry, 4 for edema, 5 for moist, 6 for color difference, 7 for dehiscence, etc.

S200: the multiple first risk variables in the first medical record data are regressed based on the multiple logistic regression model to obtain a first weight value about the progressive relationship among the multiple first risk variables. Preferably, the medical record data is searched for a disease state at the time of admission, and medical record data in which a pressure injury occurs at the time of admission is excluded. First time information of occurrence of the pressure injury in medical record data in which the pressure injury does not occur at the time of admission is acquired. And eliminating medical record data of which the first time information is less than the first threshold value and which do not generate pressure injury when being admitted, thereby obtaining first medical record data. Preferably, the first risk variable is represented as a risk variable for the occurrence of a stress injury. The first risk variables may include multiple variables such as the department of the hospital, the time of the hospital stay, sex, age, obesity (BMI), atherosclerosis, time of surgery, medication, malnutrition, mobility, etc. Typically, gender, age, BMI, time of surgery, etc. are common relevant variables. The first medical record data obtained by screening usually comprises a plurality of conditions, and if a random forest model is directly used for regression prediction, specific noise can generate an overfitting problem. For example, in the first medical record data, risk variables which are noise from each other exist between the pressure injury generated by the operation and the pressure injury generated by the non-operation, so that the output result includes more discrete noise, and overfitting is further caused.

Preferably, the multivariate logistic regression model is established by taking the first risk variables in the first medical record data as independent variables and whether the progression between the first risk variables is the dependent variable. And acquiring progressive relationships among the plurality of first risk variables based on the multiple logistic regression model. Preferably, the first risk variable is chosen randomly. Calculating progressive relationships between the first risk variable and other first risk variables based on a multiple logistic regression model. Preferably, the progressive relationship indicates whether the first risk variable a results in the production of the first risk variable B. Alternatively, the progressive relationship represents the probability that the first risk variable a produces the first risk variable B. For example, obesity, the first risk variable, may lead to the development of diabetes, the first risk variable. Preferably, the progressive relationship may also indicate that the first risk variable a results in the production of a first risk variable B, which results in the production of a first risk variable C. For example, a first risk variable surgical procedure results in the production of a first risk variable bleeding volume that results in the production of a first risk variable pressure hemostasis time. Preferably, the first weight value is a probability that the first risk variable produces other first risk variables. And under the condition of prediction by using the multiple logistic regression model, the first weight is the prediction probability calculated by the multiple logistic regression model. Preferably, the first weight value can also be represented by (x, y). x represents the number of variables that this first risk variable produces the experience of the other first risk variables. For example, the first risk variable a directly yields the first risk variable B, then x equals 0 and y equals the probability of predicted yield. If the first risk variable produces a first risk variable C by a first risk variable B, then x is 1 and y is equal to the product of the probability of producing the first risk variable B and the probability of producing the first risk variable C by the first risk variable B.

S300: the first medical record data are divided based on the first weight values to form a plurality of second medical record data, and a random forest model is adopted to model the plurality of second medical record data to generate a plurality of first risk prediction models. Preferably, the processing module 100 is configured to divide the first medical record data based on the first weight to form a plurality of second medical record data as follows:

constructing a progressive relation table based on each first risk variable;

acquiring a first risk variable pair of which the first weight is smaller than a second threshold;

calculating the number of the first risk variables corresponding to the first risk variables based on the progressive relation table;

Preferably, the second risk variable and the second weight of the first risk prediction model are obtained based on the kini coefficient as a splitting or competition rule of the random forest model. Preferably, the second weight is a kini coefficient. The second weight represents a degree of association of the second risk variable with the stress injury. And (3) extracting N samples from the second medical record data by using a Boos-strap sampling method through a random forest algorithm, then respectively establishing decision tree models for the N samples, wherein each decision tree consists of a root node, leaf nodes and branches, each decision tree model comprises 4 random variable attributes, splitting the node in an optimal splitting mode in 4 characteristics, and each tree grows completely without pruning to obtain a combined classifier. And classifying each test sample by utilizing N decision tree models to obtain N classification results, and finally voting the N classifications to determine the final classification result. Preferably, the expression of the pre-grouping kini coefficient g (t) is as follows:

Preferably, the method further comprises performing the steps of:

prediction error rate e of ith decision tree and out-of-bag prediction_i；

Randomly disturbing the value sequence of the prediction outside the bag on the jth input variable;

reestablishing the ith classification regression tree and predicting the observation outside the bag;

Preferably, the following problems exist for stress injury prediction and baseline stress injury score modification using non-linear regression:

1. the problem of multivariate commonality faced by a pressure injury risk model cannot be solved by nonlinear regression;

2. non-linear regression ignores complex relationships between risk variables when a linear relationship between the risk variables and the risk of a stress injury occurring is not established, or there is interaction between multiple risk variables. Therefore, in order to solve the problem that the real effective risk variables cannot be accurately screened out by using nonlinear regression due to the multivariate commonality and the interaction existing in the risk variables, the regression prediction of the medical record data can be carried out by adopting a random forest model. However, when solving the regression prediction problem, the random forest model cannot provide continuous output. This is because the random forest model generally outputs results by using an averaging method, a voting method, and a learning method. The averaging method is generally used for the regression prediction problem, and the average value of each decision tree is used to obtain the final prediction output, and the obtained values are all discrete values. The voting method and the learning method output numerical values in the same way, so that when regression prediction is carried out, the random forest model cannot make prediction exceeding the data range of the training set, and further, when specific noise exists in medical record data, an overfitting phenomenon occurs when modeling is carried out by using the random forest model. Therefore, the invention classifies the random forest model multi-history data, and can comprehensively screen out the risk variable related to the stress injury, namely the first risk variable. And then modeling the screened first risk variables based on a multiple logistic regression model so as to obtain a progressive relation between the first risk variables, further screening according to the progressive relation between the first risk variables to obtain relatively isolated variables in the first risk variables, and classifying the first medical record data according to the isolated variables to obtain second medical record data. Through this setting mode, the beneficial effect who reaches is:

the second medical record data obtained by classifying the first medical record data through the first weight is equivalent to classifying the data of specific noise in the first medical record data, and random forest model modeling is performed after the data of the same specific noise is divided into the same group, so that the influence caused by noise can be remarkably reduced, the phenomenon of over-fitting is avoided, and the constructed risk prediction model can be generalized (applied) to new medical record data. For ease of understanding, the following is described in terms of stress injury risk prediction:

Preferably, when the plurality of first risk prediction models are classified to generate the plurality of second risk variables, the plurality of first risk prediction models are cross-validated to adjust the number of the second risk variables and the second weight value representing the correlation degree of the occurrence of the pressure damage. Through this setting mode, the beneficial effect who reaches is:

However, the following problems exist in practical application:

2. the first risk prediction model is not subjected to cross validation, the capability of resisting other irrelevant risk variables cannot be guaranteed, and the problem of poor stability may exist;

obtaining the correlation degree among a plurality of first risk variables based on a multiple logistic regression model;

the first medical record data is divided based on the degree of association to generate second medical record data. Through this setting mode, the beneficial effect who reaches is:

It should be noted that the relevant medical record data of the patient may be a composite type, that is, two or more first risk prediction models may be adapted to the medical record data, and thus the first risk prediction models need to ensure the associability thereof, or the first risk prediction models need to have an expansion capability of incorporating new risk variables. Preferably, the data volume of the second risk variable in the second medical record data is averaged. The second risk variable is partitioned based on the degree of association, thereby generating a plurality of third risk variables. Preferably, each divided type of third risk variable contains the same number of second risk variables. Through this setting mode, the beneficial effect who reaches is:

The present specification encompasses multiple inventive concepts and the applicant reserves the right to submit divisional applications according to each inventive concept. The present description contains several inventive concepts, such as "preferably", "according to a preferred embodiment" or "optionally", each indicating that the respective paragraph discloses a separate concept, the applicant reserves the right to submit divisional applications according to each inventive concept.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于区块链的非接触式防疫方法及其系统

Device and method for constructing pressure damage risk prediction model

相关技术

网友询问留言