Medical text data processing method and device, computer equipment and storage medium

文档序号:1087406 发布日期:2020-10-20 浏览:8次 中文

阅读说明:本技术 医疗文本数据的处理方法、装置、计算机设备和存储介质 (Medical text data processing method and device, computer equipment and storage medium ) 是由 许水琴 于 2020-06-23 设计创作,主要内容包括:本申请涉及人工智能技术领域,提供医疗文本数据的处理方法、装置、计算机设备和存储介质,包括:获取医疗文本数据;分别输入至第一识别模型、第二识别模型以及第三识别模型中;分别通过第一识别模型、第二识别模型、第三识别模型预测医疗文本数据中每个字符对应的第一标注结果、第二标注结果、第三标注结果;判断第一标注结果、第二标注结果、第三标注结果是否相同;当标注结果相同时,将第一标注结果确定为字符对应的标注结果;提取医疗文本数据中的命名实体,进行支付测算处理。本申请中通过多个模型的预测一致性,来提升模型预测的准确率,从而提升命名实体识别的准确率。本申请中的方案可应用于智慧医疗领域中,从而推动智慧城市的建设。(The application relates to the technical field of artificial intelligence, and provides a method, a device, computer equipment and a storage medium for processing medical text data, wherein the method comprises the following steps: acquiring medical text data; respectively inputting the data into a first recognition model, a second recognition model and a third recognition model; predicting a first labeling result, a second labeling result and a third labeling result corresponding to each character in the medical text data through the first recognition model, the second recognition model and the third recognition model respectively; judging whether the first labeling result, the second labeling result and the third labeling result are the same; when the labeling results are the same, determining the first labeling result as a labeling result corresponding to the character; and (4) extracting the named entities in the medical text data, and performing payment measurement and calculation processing. According to the method and the device, the accuracy of model prediction is improved through the prediction consistency of the multiple models, and therefore the accuracy of named entity recognition is improved. The scheme in this application can be applied to in the wisdom medical treatment field to promote the construction in wisdom city.)

1. A method for processing medical text data, comprising the steps of:

acquiring medical text data;

inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, and the label with the third highest probability is used as a third label result of the character predicted by the third recognition model;

respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same;

if the first labeling result is the same as the character labeling result, determining the first labeling result as the labeling result corresponding to the character;

and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

2. The method for processing medical text data according to claim 1, wherein the step of determining whether the first labeling result, the second labeling result, and the third labeling result corresponding to each character are the same comprises:

if not, calculating the total probability of the character being predicted as the third labeling result according to the first probability of the first recognition model predicting the character as the third labeling result, the second probability of the second recognition model predicting the character as the third labeling result, the third probability of the third recognition model predicting the character as the third labeling result, and preset weights corresponding to the predicted results of the first recognition model, the second recognition model and the third recognition model;

judging whether the total probability is greater than a threshold value, if so, taking the third labeling result as a labeling result corresponding to the character;

and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

3. The method for processing medical text data according to claim 2, wherein the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively is preceded by:

sequentially inputting the sample data in the medical field data set into the first identification model, the second identification model and the third identification model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

and calculating the ratio of the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model, and determining the preset weight corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

4. The method for processing medical text data according to claim 1, wherein the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively is preceded by:

training a BilSTM-CRF model based on the public data set to obtain the first recognition model, training the BilSTM-CRF model based on the medical field data set to obtain the second recognition model, and training the BilSTM-CRF model based on the public data set and the medical field data set to obtain a third recognition model;

randomly selecting two models from the first identification model, the second identification model and the third identification model, and sequentially selecting one unmarked target data from the unmarked data set to input the unmarked target data into the two selected models for prediction to obtain the corresponding prediction and marking results of the two models;

and if the corresponding prediction labeling results of the two models are the same, adding the corresponding prediction labeling result to the non-labeling target data, and inputting the non-labeling target data to a third model which is not selected for iterative training.

5. The method for processing medical text data according to claim 1, wherein the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively is preceded by:

acquiring a preset target text; wherein the target text is text data of a medical field;

adding each sample in the public data set into the target text respectively, generating a public data training text correspondingly, and inputting all the generated public data training texts into the BilSTM-CRF model in sequence to train to obtain the first recognition model;

adding each sample in the medical field data set into the target text respectively, generating a medical data training text correspondingly, and inputting all the generated medical data training texts into the BilSTM-CRF model in sequence to train to obtain the second recognition model;

and iteratively selecting a sample from the public data set and the medical field data set respectively, adding the samples to the target text together, correspondingly generating a target data training text, and sequentially inputting all the generated target data training texts into the BilTM-CRF model for training to obtain the third recognition model.

6. An apparatus for processing medical text data, comprising:

a first acquisition unit configured to acquire medical text data;

the first input unit is used for inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

the prediction unit is used for predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, and the label with the third highest probability is used as a third label result of the character predicted by the third recognition model;

the judging unit is used for respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same or not;

a first determining unit, configured to determine the first labeling result as a labeling result corresponding to the character if the first labeling result, the second labeling result, and the third labeling result are the same;

and the first processing unit is used for extracting the named entity in the medical text data according to the labeling result and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

7. The apparatus for processing medical text data according to claim 6, further comprising:

if the first recognition model and the second recognition model are different, predicting a first probability that the character is the third labeling result according to the first recognition model, predicting a second probability that the character is the third labeling result according to the second recognition model, predicting a third probability that the character is the third labeling result according to the third recognition model, and calculating a total probability that the character is predicted as the third labeling result according to preset weights corresponding to prediction results of the first recognition model, the second recognition model and the third recognition model;

a second determining unit, configured to determine whether the total probability is greater than a threshold, and if so, take the third labeling result as a labeling result corresponding to the character;

and the second processing unit is used for extracting the named entity in the medical text data according to the labeling result and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

8. The apparatus for processing medical text data according to claim 7, further comprising:

the second input unit is used for sequentially inputting the sample data in the medical field data set into the first recognition model, the second recognition model and the third recognition model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

the second calculation unit is used for respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

and the third calculating unit is used for calculating the ratio of the accuracy rates of the prediction results of the first recognition model, the second recognition model and the third recognition model and determining the preset weights corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing medical text data, a computer device, and a storage medium.

Background

The traditional payment measurement and calculation process mainly comprises the following steps: manually collecting historical data, and collecting information and expense details of the first page of the hospitalized medical records in different medical institutions in the implementation area of the last three years; manually storing the data into an excel table; carrying out manual analysis and screening secondary processing on the excel data; and manually screening the payment data to calculate related index data, predicting future payment standards and generating corresponding measuring and calculating results. This conventional approach has a number of drawbacks, such as: 1. the procedure is complicated, and the hysteresis is relatively large; 2. manpower and material resources are occupied; 3. the manual operation is easy to generate errors, and different human calculation methods have various differences and are not uniform in standard, so that the measurement and calculation result is inaccurate; 4. the method is not conducive to reuse, resulting in a large amount of repetitive labor.

Thus, automated payment estimation using payment budgeting tools, such as drg-based payment estimation tools, are currently emerging. In the payment calculation tool based on drg payment, named entities such as hospital names, regions, departments and the like included in medical text data need to be accurately identified; the current recognition accuracy is low, and the payment measurement is not facilitated.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a computer device, and a storage medium for processing medical text data, and aims to overcome the defect that named entities included in medical text data cannot be accurately identified at present.

In order to achieve the above object, the present application provides a method for processing medical text data, comprising the following steps:

acquiring medical text data;

inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, and the label with the third highest probability is used as a third label result of the character predicted by the third recognition model;

respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same;

if the first labeling result is the same as the character labeling result, determining the first labeling result as the labeling result corresponding to the character;

and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

Further, after the step of respectively determining whether the first labeling result, the second labeling result, and the third labeling result corresponding to each character are the same, the method includes:

if not, calculating the total probability of the character being predicted as the third labeling result according to the first probability of the first recognition model predicting the character as the third labeling result, the second probability of the second recognition model predicting the character as the third labeling result, the third probability of the third recognition model predicting the character as the third labeling result, and preset weights corresponding to the predicted results of the first recognition model, the second recognition model and the third recognition model;

judging whether the total probability is greater than a threshold value, if so, taking the third labeling result as a labeling result corresponding to the character;

and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

Further, the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively comprises:

sequentially inputting the sample data in the medical field data set into the first identification model, the second identification model and the third identification model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

and calculating the ratio of the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model, and determining the preset weight corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

Further, the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively comprises:

training a BilSTM-CRF model based on the public data set to obtain the first recognition model, training the BilSTM-CRF model based on the medical field data set to obtain the second recognition model, and training the BilSTM-CRF model based on the public data set and the medical field data set to obtain a third recognition model;

randomly selecting two models from the first identification model, the second identification model and the third identification model, and sequentially selecting one unmarked target data from the unmarked data set to input the unmarked target data into the two selected models for prediction to obtain the corresponding prediction and marking results of the two models;

and if the corresponding prediction labeling results of the two models are the same, adding the corresponding prediction labeling result to the non-labeling target data, and inputting the non-labeling target data to a third model which is not selected for iterative training.

Further, the step of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively comprises:

acquiring a preset target text; wherein the target text is text data of a medical field;

adding each sample in the public data set into the target text respectively, generating a public data training text correspondingly, and inputting all the generated public data training texts into the BilSTM-CRF model in sequence to train to obtain the first recognition model;

adding each sample in the medical field data set into the target text respectively, generating a medical data training text correspondingly, and inputting all the generated medical data training texts into the BilSTM-CRF model in sequence to train to obtain the second recognition model;

and iteratively selecting a sample from the public data set and the medical field data set respectively, adding the samples to the target text together, correspondingly generating a target data training text, and sequentially inputting all the generated target data training texts into the BilTM-CRF model for training to obtain the third recognition model.

The present application also provides a processing apparatus for medical text data, including:

a first acquisition unit configured to acquire medical text data;

the first input unit is used for inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

the prediction unit is used for predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, and the label with the third highest probability is used as a third label result of the character predicted by the third recognition model;

the judging unit is used for respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same or not;

a first determining unit, configured to determine the first labeling result as a labeling result corresponding to the character if the first labeling result, the second labeling result, and the third labeling result are the same;

and the first processing unit is used for extracting the named entity in the medical text data according to the labeling result and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

Further, the processing device of the medical text data further comprises:

if the first recognition model and the second recognition model are different, predicting a first probability that the character is the third labeling result according to the first recognition model, predicting a second probability that the character is the third labeling result according to the second recognition model, predicting a third probability that the character is the third labeling result according to the third recognition model, and calculating a total probability that the character is predicted as the third labeling result according to preset weights corresponding to prediction results of the first recognition model, the second recognition model and the third recognition model;

a second determining unit, configured to determine whether the total probability is greater than a threshold, and if so, take the third labeling result as a labeling result corresponding to the character;

and the second processing unit is used for extracting the named entity in the medical text data according to the labeling result and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

Further, the processing device of the medical text data further comprises:

the second input unit is used for sequentially inputting the sample data in the medical field data set into the first recognition model, the second recognition model and the third recognition model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

the second calculation unit is used for respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

and the third calculating unit is used for calculating the ratio of the accuracy rates of the prediction results of the first recognition model, the second recognition model and the third recognition model and determining the preset weights corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

Further, the processing device of medical text data further comprises:

the pre-training unit is used for training a BilSTM-CRF model based on the public data set to obtain the first recognition model, training the BilSTM-CRF model based on the medical field data set to obtain the second recognition model, and training the BilSTM-CRF model based on the public data set and the medical field data set to obtain a third recognition model;

the selection unit is used for randomly selecting two models from the first identification model, the second identification model and the third identification model, and sequentially selecting one unmarked target data from the unmarked data set to input the unmarked target data into the two selected models for prediction to obtain the corresponding prediction and marking results of the two models;

and the iterative training unit is used for adding the corresponding prediction labeling result to the label-free target data and inputting the label-free target data to a third unselected model for iterative training if the prediction labeling results corresponding to the two models are the same.

Further, the processing device of medical text data further comprises:

the second acquisition unit is used for acquiring a preset target text; wherein the target text is text data of a medical field;

the first training unit is used for respectively adding each sample in the public data set into the target text, respectively and correspondingly generating a public data training text, and sequentially inputting all the generated public data training texts into the BilTM-CRF model for training to obtain the first recognition model;

the second training unit is used for respectively adding each sample in the medical field data set into the target text, respectively and correspondingly generating a medical data training text, and sequentially inputting all the generated medical data training texts into the BilTM-CRF model for training to obtain the second recognition model;

and the third training unit is used for iteratively selecting a sample from the public data set and the medical field data set respectively, adding the samples into the target text together, correspondingly generating a target data training text, and inputting all the generated target data training texts into the BilSTM-CRF model in sequence to train to obtain the third recognition model.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

The application provides a medical text data processing method, a medical text data processing device, computer equipment and a storage medium, wherein the medical text data processing method comprises the following steps: acquiring medical text data; inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; wherein, the training samples of the three models are different; predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same; when the labeling results are the same, determining the first labeling result as the labeling result corresponding to the character; and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing. According to the method and the device, the accuracy of model prediction needs to be improved through the prediction consistency of a plurality of models, so that the accuracy of named entity identification is improved, and the payment measurement and calculation tool can be used for accurately measuring and calculating.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for processing medical text data according to an embodiment of the present application;

fig. 2 is a block diagram of a processing apparatus for medical text data according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a method for processing medical text data, including the following steps:

step S1, acquiring medical text data;

step S2, inputting the medical text data into a first recognition model, a second recognition model, and a third recognition model, respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

step S3, predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, the label with the third highest probability is used as a third label result of the character predicted by the third recognition model, and each label is B, I, E, O, S;

step S4, respectively determining whether the first annotation result, the second annotation result, and the third annotation result corresponding to each character are the same;

step S5, if the first annotation result is the same as the annotation result corresponding to the character, determining the first annotation result as the annotation result corresponding to the character;

and step S6, extracting the named entity in the medical text data according to the labeling result, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

In the embodiment, the method is applied to a smart medical scene of a smart city, so as to promote the construction of the smart city. In particular, the method can be applied to medical information scenes of digital medical treatment. In the data acquisition stage in the payment measurement and calculation scene, in the payment measurement and calculation in the current medical scene, the acquired data is usually medical text data of each medical institution, and the medical text data is usually the first page information and expense details of the medical records in the medical institution; the system comprises more named entity information, such as medical institution names, department names, names of attending physicians, locations of medical institutions, drug names in expenses and the like; in the drg payment calculation tool, when performing payment calculation based on the medical text data, it is necessary to identify each named entity in the medical text data for classification processing. Therefore, after the medical text data is input into the system, the named entity recognition process is first performed.

Specifically, as described in the above step S1, the medical text data may be obtained from electronic systems of various medical institutions, and is a text file in which a large amount of medical information is recorded.

As described in step S2, three usable recognition models, namely, a first recognition model, a second recognition model and a third recognition model, are trained in advance; the three recognition models are obtained by training based on a BilSTM-CRF model, and the difference is that training samples adopted for training the BilSTM-CRF model are different, and when the training samples are different, the finally obtained recognition models are also different in prediction results.

The public data sets are a large number of data sets with named entity labels and are disclosed in big data, the data volume is large, the sources are wide, and the data acquisition is easy; therefore, the first recognition model obtained by training the BilSTM-CRF model based on the public data set has stronger robustness due to the larger data volume of the training sample.

Because the same word can have different meanings in different fields, specific named entity labeling needs to be carried out aiming at different fields to obtain a training sample, the data set in the medical field is a data set which is specially marked by named entities aiming at the medical field, and the data set in the medical field has strong professional pertinence but small data volume. Therefore, the second recognition model obtained by training the BilSTM-CRF model based on the medical field data set has strong professional recognition capability for named entity recognition in the medical field, but has poor robustness.

The third recognition model is obtained by training the BilSTM-CRF model based on the public data set and the medical field data set, and the training sample of the third recognition model adopts the public data set and the medical field data set, so that the third recognition model not only has strong robustness, but also has strong professional recognition capability, and can improve the generalization capability of the model.

In the present embodiment, the medical text data is input to the first recognition model, the second recognition model, and the third recognition model, respectively, to predict the result. It can be understood that the predicted results of the first recognition model, the second recognition model and the third recognition model are probabilities that each character in the medical text data corresponds to each label, and when the probability of a label is the maximum, the character is indicated as the corresponding label; wherein each of said labels is B, I, E, O, S; b represents entity beginning, I represents entity inside, O represents non-entity, E represents entity ending, and S represents single-word entity. For example, if a medical text is a cef 25 yuan per box, the characters in the medical text may be labeled head-B, spore-I, 2-I, 5-I, yuan-E, per-O, box-O in sequence; and combining the characters between the label B and the label E into a whole according to the label, wherein the whole is the named entity extracted from the text. In the medical field, a single-word named entity is not usually used, and therefore, a single-word entity labeled as S may not be extracted in this embodiment.

The first recognition model, the second recognition model and the third recognition model are integrated with the same word embedding model so as to construct a word vector for the medical text data, such as a currently general word2vec model.

As described in step S3, the first recognition model, the second recognition model and the third recognition model are different from each other and have different attention dimensions in the medical text data, so that the predicted results may be different from each other.

As described in the foregoing steps S4-S5, whether the first annotation result, the second annotation result, and the third annotation result corresponding to each character are the same or not is respectively determined, if the predicted results are consistent, the predicted result is determined to be correct, and any one of the first annotation result, the second annotation result, and the third annotation result is taken as the annotation result corresponding to the character; if the prediction results are different, the prediction results have deviation, and the accuracy is not high.

In the embodiment, the three recognition models are adopted to respectively carry out result prediction, the voting consistency principle is adopted to express the confidence coefficient of the prediction result, the reliability of the model prediction result is improved, the recognition effect of the model is better, the recognition effect of the named entity in the dependent text data is better, and the generalization capability of the model recognition is improved.

Finally, as stated in step S6, according to the labeling result, the named entity in the medical text data can be extracted; further, the named entities extracted from the medical text data are classified and input into corresponding areas in a payment measuring and calculating tool for subsequent processing. In this embodiment, the named entity extraction method is adopted, so that the named entity extraction accuracy is improved, and statistics of subsequent payment measurement and calculation is facilitated. Specifically, in this embodiment, the performing the payment calculation processing based on drg payment includes:

importing named entities in the medical text data; establishing a code matching task to perform code matching processing; if the code matching is successful, performing quality control processing on the newly added quality control task; if the quality control is successful, a grouping task is newly added for grouping processing; if the grouping is successful, adding a cutting task for cutting; if the cutting is successful, a new measurement and calculation task is added to carry out payment measurement and calculation; and if the measurement and calculation are successful, adding a simulation task for simulation processing. The drg payment-based payment calculation tool provides quick and intelligent calculation service for users, and the system mainly pursues the following aims: simplicity, adaptability, scalability. In practical application, a user only needs to import relevant data and then simply click buttons in the above processes, the processes are automatically circulated, code matching, quality control, grouping, measuring, calculating and analyzing are automatically completed, simplicity and convenience are achieved, and a large amount of repeated labor is avoided.

In an embodiment, the named entity, the first recognition model, the second recognition model, and the third recognition model extracted from the medical text data may be stored in a block chain. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In an embodiment, when the first annotation result, the second annotation result, and the third annotation result are different, it may be that some models predict inaccurately, and the other models predict accurately; therefore, when the first annotation result, the second annotation result, and the third annotation result are different, the annotation result corresponding to the character can be further determined as follows.

After the step S4 of determining whether the first labeling result, the second labeling result, and the third labeling result corresponding to each character are the same, the method includes:

step S5a, if the first probability that the character is the third labeling result is predicted by the first recognition model, the second probability that the character is the third labeling result is predicted by the second recognition model, the third probability that the character is the third labeling result is predicted by the third recognition model, and preset weights corresponding to the predicted results of the first recognition model, the second recognition model and the third recognition model are used for calculating the total probability that the character is predicted as the third labeling result;

step S5b, judging whether the total probability is greater than a threshold value, if so, taking the third labeling result as a labeling result corresponding to the character;

and S5c, extracting the named entity in the medical text data according to the labeling result, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

In this embodiment, since the training samples adopted by the third recognition model are the public data set and the medical field data set, the accuracy of the model recognition can be correspondingly improved, and the accuracy of the prediction result of the third recognition model in the three recognition models is the highest. Therefore, the third labeling result predicted by the third recognition model can be used as a to-be-selected labeling result, and the other two recognition models also have probabilities of correspondingly predicting the character as the third labeling result; therefore, the probabilities that the characters are predicted to be the third labeling results by the three recognition models can be weighted to obtain the total probability that the characters are the third labeling results predicted by the three recognition models. It is understood that the preset weight used in the above-mentioned weighting calculation is preset during the model training.

After the total probability that the character predicted by the three recognition models is the third labeling result is obtained, whether the total probability is greater than a threshold value or not is judged, and if the total probability is greater than the threshold value, the confidence coefficient is higher, so that the third labeling result can be used as the labeling result corresponding to the character. If the confidence coefficient is smaller than the threshold value, the confidence coefficient is low, at this time, a prediction result with a second probability rank can be selected from prediction results obtained by the third recognition model to serve as a marking result to be selected, and then the step of calculating the total probability is carried out, so that a marking result corresponding to the character is obtained.

In an embodiment, before the step S2 of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively, the method includes:

a. sequentially inputting the sample data in the medical field data set into the first identification model, the second identification model and the third identification model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

b. respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

in this embodiment, since the three recognition models have different recognition accuracy rates, sample data in a known medical field data set may be input into the first recognition model, the second recognition model, and the third recognition model to predict a result, and whether the predicted result is consistent with a correct labeling result or not may be determined; and determining the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the same number of the prediction results as the correct labeling results and the total number of the sample data.

c. And calculating the ratio of the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model, and determining the preset weight corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

And the ratio of the preset weights of the prediction results of the first recognition model, the second recognition model and the third recognition model is the ratio of the accuracy rates of the prediction results of the first recognition model, the second recognition model and the third recognition model. For example, if the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model is 0.7, 0.75 and 0.85 respectively, the ratio of the accuracy is 0.7:0.75: 0.85; if the ratio of the preset weights is also 0.7:0.75:0.85, the final preset weights of the prediction results of the first recognition model, the second recognition model and the third recognition model are respectively as follows: 0.3, 0.33, 0.37.

The first recognition model, the second recognition model and the third recognition model have different prediction results with different accuracy rates, and it can be understood that the prediction results have a higher weight ratio as the accuracy rates are higher.

In one embodiment, the step S2 of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively comprises:

s21, training a BilSTM-CRF model based on the public data set to obtain the first recognition model, training the BilSTM-CRF model based on the medical field data set to obtain the second recognition model, and training the BilSTM-CRF model based on the public data set and the medical field data set to obtain a third recognition model;

s22, randomly selecting two models from the first recognition model, the second recognition model and the third recognition model, sequentially selecting one unmarked target data from the unmarked data set, inputting the unmarked target data into the two selected models for prediction, and obtaining the corresponding prediction and marking results of the two models;

and S23, if the corresponding prediction labeling results of the two models are the same, adding the corresponding prediction labeling results to the label-free target data, and inputting the label-free target data to a third model which is not selected for iterative training.

In this embodiment, in order to continue training the first recognition model, the second recognition model, and the third recognition model and make the prediction results of the first recognition model, the second recognition model, and the third recognition model consistent for the same text data, after the first recognition model, the second recognition model, and the third recognition model are obtained by training, two models are randomly selected from the first recognition model, the second recognition model, and the third recognition model, and one label-free target data is sequentially selected from one label-free data set (i.e., an unknown data set without labels added) and input into the two selected models for prediction, so as to obtain the prediction labeling results corresponding to the two models; when the corresponding prediction labeling results of the two models are the same, the confidence degrees of the prediction results of the two models are high; at this time, after adding the corresponding prediction labeling result to the selected label-free target data, inputting the selected label-free target data into a third unselected model for iterative training until the label-free target data in the label-free data set is not updated any more, and then finishing training. After the training, the prediction results of the first recognition model, the second recognition model, and the third recognition model for the same text data may be made to be the same. Moreover, in the training mode, the confidence coefficient of the model is expressed by the voting consistency of the three models, so that the reliability of the model is improved, and the training effect of the model is better; meanwhile, a label-free data set is added to the model training, so that the training data volume is increased, and the model training effect is improved. Preferably, after the three models name entities to the medical text data, the medical text data may be further iteratively trained as training samples of the three models. The training method in the embodiment adopts a part of data sets without labels (namely unknown data sets) for training, which is an innovative semi-supervised training method and increases the training data volume; meanwhile, iterative training is carried out by adopting voting consistency of the three models, and the confidence coefficient of the models is improved.

In an embodiment, before the step S2 of inputting the medical text data into the first recognition model, the second recognition model and the third recognition model respectively, the method includes:

s201, acquiring a preset target text; wherein the target text is text data of a medical field;

s202, adding each sample in the public data set into the target text respectively, generating a public data training text correspondingly, and inputting all the generated public data training texts into the BilSTM-CRF model in sequence to train to obtain the first recognition model;

s203, adding each sample in the medical field data set into the target text respectively, generating a medical data training text correspondingly, and inputting all the generated medical data training texts into the BilSTM-CRF model in sequence to train to obtain the second recognition model;

and S204, iteratively selecting a sample from the public data set and the medical field data set respectively, adding the samples into the target text together, correspondingly generating a target data training text, and inputting all the generated target data training texts into the BilSTM-CRF model in sequence to train to obtain the third recognition model.

In this embodiment, when the first recognition model, the second recognition model and the third recognition model are trained, in order to further improve the labeling accuracy of the models on the medical text data, the training samples of the first recognition model, the second recognition model and the third recognition model are respectively added to the text data of one medical field, and then the text data of the medical field added with the training samples is input into the BiLSTM-CRF model for iterative training to obtain corresponding models; due to the fact that the characteristics of the training samples in the text data of the medical field are mixed in the training process, the model obtained through training has stronger generalization capability in the follow-up prediction of the medical text data, and the model prediction effect is improved.

Referring to fig. 2, an embodiment of the present application further provides a processing apparatus for medical text data, including:

a first acquisition unit 10 for acquiring medical text data;

a first input unit 20 for inputting the medical text data into a first recognition model, a second recognition model and a third recognition model, respectively; the first identification model is obtained by training a BiLSTM-CRF model based on a public data set, the second identification model is obtained by training a BiLSTM-CRF model based on a medical field data set, and the third identification model is obtained by training the BiLSTM-CRF model based on the public data set and the medical field data set;

the prediction unit 30 is used for predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; wherein the label with the first highest probability is used as a first label result of the character predicted by the first recognition model, the label with the second highest probability is used as a first label result of the character predicted by the second recognition model, and the label with the third highest probability is used as a third label result of the character predicted by the third recognition model;

a determining unit 40, configured to determine whether the first labeling result, the second labeling result, and the third labeling result corresponding to each character are the same;

a first determining unit 50, configured to determine the first labeling result as a labeling result corresponding to the character if the first labeling result, the second labeling result, and the third labeling result are the same;

the first processing unit 60 is configured to extract a named entity in the medical text data according to the labeling result, and input the named entity into a payment calculation tool for payment calculation processing.

In one embodiment, the processing device of medical text data further comprises:

if the first recognition model and the second recognition model are different, predicting a first probability that the character is the third labeling result according to the first recognition model, predicting a second probability that the character is the third labeling result according to the second recognition model, predicting a third probability that the character is the third labeling result according to the third recognition model, and calculating a total probability that the character is predicted as the third labeling result according to preset weights corresponding to prediction results of the first recognition model, the second recognition model and the third recognition model;

a second determining unit, configured to determine whether the total probability is greater than a threshold, and if so, take the third labeling result as a labeling result corresponding to the character;

and the second processing unit is used for extracting the named entity in the medical text data according to the labeling result and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing.

In one embodiment, the processing device of medical text data further comprises:

the second input unit is used for sequentially inputting the sample data in the medical field data set into the first recognition model, the second recognition model and the third recognition model for prediction to obtain a labeling result corresponding to each sample data; wherein the sample data comprises a correct labeling result;

the second calculation unit is used for respectively calculating the accuracy of the prediction results of the first recognition model, the second recognition model and the third recognition model according to the predicted labeling results corresponding to all the sample data and the correct labeling result of the sample data;

and the third calculating unit is used for calculating the ratio of the accuracy rates of the prediction results of the first recognition model, the second recognition model and the third recognition model and determining the preset weights corresponding to the prediction results of the first recognition model, the second recognition model and the third recognition model according to the ratio.

In one embodiment, the apparatus for processing medical text data further includes:

the pre-training unit is used for training a BilSTM-CRF model based on the public data set to obtain the first recognition model, training the BilSTM-CRF model based on the medical field data set to obtain the second recognition model, and training the BilSTM-CRF model based on the public data set and the medical field data set to obtain a third recognition model;

the selection unit is used for randomly selecting two models from the first identification model, the second identification model and the third identification model, and sequentially selecting one unmarked target data from the unmarked data set to input the unmarked target data into the two selected models for prediction to obtain the corresponding prediction and marking results of the two models;

and the iterative training unit is used for adding the corresponding prediction labeling result to the label-free target data and inputting the label-free target data to a third unselected model for iterative training if the prediction labeling results corresponding to the two models are the same.

In one embodiment, the apparatus for processing medical text data further includes:

the second acquisition unit is used for acquiring a preset target text; wherein the target text is text data of a medical field;

the first training unit is used for respectively adding each sample in the public data set into the target text, respectively and correspondingly generating a public data training text, and sequentially inputting all the generated public data training texts into the BilTM-CRF model for training to obtain the first recognition model;

the second training unit is used for respectively adding each sample in the medical field data set into the target text, respectively and correspondingly generating a medical data training text, and sequentially inputting all the generated medical data training texts into the BilTM-CRF model for training to obtain the second recognition model;

and the third training unit is used for iteratively selecting a sample from the public data set and the medical field data set respectively, adding the samples into the target text together, correspondingly generating a target data training text, and inputting all the generated target data training texts into the BilSTM-CRF model in sequence to train to obtain the third recognition model.

In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit in the above apparatus embodiment, which is not described herein again.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing medical text data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of processing medical textual data.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method of processing medical text data. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

In summary, the method, the apparatus, the computer device and the storage medium for processing medical text data provided in the embodiments of the present application include: acquiring medical text data; inputting the medical text data into a first recognition model, a second recognition model and a third recognition model respectively; wherein, the training samples of the three models are different; predicting a first probability that each character in the medical text data corresponds to each label through the first recognition model; predicting a second probability that each character in the medical text data corresponds to each label through the second recognition model; predicting a third probability that each character in the medical text data corresponds to each label through the third recognition model; respectively judging whether the first labeling result, the second labeling result and the third labeling result corresponding to each character are the same; when the labeling results are the same, determining the first labeling result as the labeling result corresponding to the character; and according to the labeling result, extracting the named entity in the medical text data, and inputting the named entity into a payment measuring and calculating tool for payment measuring and calculating processing. According to the method and the device, the accuracy of model prediction needs to be improved through the prediction consistency of a plurality of models, so that the accuracy of named entity identification is improved, and the payment measurement and calculation tool can be used for accurately measuring and calculating.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种面向PDF格式论文的生物医学实体识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!