GPT-2 model-based Chinese electronic medical record entity identification method

文档序号：1614261 发布日期：2020-01-10 浏览：20次中文

阅读说明：本技术 基于gpt-2模型的中文电子病历实体识别方法 (GPT-2 model-based Chinese electronic medical record entity identification method ) 是由朱国胜吴善超刘飞鸿祁小云吴梦宇于 2019-10-06 设计创作，主要内容包括：本发明涉及一种基于GPT-2模型的中文电子病历实体识别方法,利用GPT-2预训练模型提取电子病例的特征向量,再从CRF模型作为出口得到识别概率,最终得到中文电子病例的命名实体,所述方法包括如下步骤：1)将中文电子病历的数据分为训练集和测试集两个部分,并对两个部分的数据进行统一标注,标注后的数据包含原始中文电子病历和实体标注；2)以GPT-2预训练模型为基础,引入CRF模型,建立基于GPT2-CRF的中文电子病历实体识别模型,使用训练集数据训练,得到训练后的中文电子病历实体识别模型；3)将测试集数据输入中文电子病历实体识别模型中,通过评估分数得到实体识别的最优标注序列。该方法不受文本形式限制,容易实现,并且开发和运行成本低。(The invention relates to a GPT-2 model-based Chinese electronic medical record entity recognition method, which comprises the following steps of extracting a feature vector of an electronic case by utilizing a GPT-2 pre-training model, obtaining recognition probability from a CRF (fuzzy C-factor) model serving as an outlet, and finally obtaining a named entity of the Chinese electronic case, wherein the method comprises the following steps: 1) dividing the data of the Chinese electronic medical record into a training set and a testing set, uniformly labeling the data of the two parts, wherein the labeled data comprises an original Chinese electronic medical record and an entity label; 2) introducing a CRF (conditional random access) model based on a GPT-2 pre-training model, establishing a Chinese electronic medical record entity recognition model based on GPT2-CRF, and training by using training set data to obtain a trained Chinese electronic medical record entity recognition model; 3) and inputting the test set data into the Chinese electronic medical record entity recognition model, and obtaining the optimal labeling sequence of entity recognition by evaluating the scores. The method is not limited by text form, is easy to realize, and has low development and operation cost.)

1. The Chinese electronic medical record entity recognition method based on the GPT-2 model is characterized in that a GPT-2 pre-training model is used for extracting feature vectors of electronic cases, recognition probability is obtained by taking a CRF model as an outlet, and finally named entities of the Chinese electronic cases are obtained, and the method comprises the following steps:

1) dividing the data of the Chinese electronic medical record into a training set and a testing set, uniformly labeling the data of the two parts, wherein the labeled data comprises an original Chinese electronic medical record and an entity label;

1.1) the entity classes of the set tag are: body part, symptoms/signs, examinations/tests and diseases/diagnoses;

1.2) setting a plurality of labeling groups, and manually labeling all medical records of the training set and the testing set according to the entity types to obtain the training set and the testing set of the experiment respectively, wherein the first column of labeling results is an entity word, the second column is the starting position of the word in the medical records, the third column is the ending position of the word in the medical records, and the last column is the entity type;

1.3) the original data of the Chinese electronic medical record is x ═ x (x)₁,x₂,x₃,...,x_n) The entity is labeled as y ═ y (y)₁,y₂,y₃,...,y_n) Wherein x is a medical record original text, y is an entity category label which corresponds to the medical record original text and has the same length as the medical record original text, and n is a corresponding data serial number;

1.4) outputting label texts of body parts, symptoms/signs, examination/inspection and diseases/diagnosis, wherein the label symbols are in the forms of P, S, T and D, and are PSTD labels for short;

2) introducing a CRF (conditional random access) model based on a GPT-2 pre-training model, establishing a Chinese electronic medical record entity recognition model based on GPT2-CRF, and training by using training set data to obtain a trained Chinese electronic medical record entity recognition model;

2.1) downloading a GPT-2 pre-training model, obtaining the input semantic representation of the text through the GPT-2 pre-training model, simultaneously performing supervised training, and finally inputting the result into a label sequence with the maximum probability;

2.2) defining the predicted value of the language model as p(s)_n-k,...,s_n|s₁,s₂,...,s_n-k-1) Wherein s represents the prediction result of the original data, k represents the sequence number deviation value of the original data, and n represents the number of the predicted value in the original data;

2.3) estimating by using a CRF model method to obtain an identification probability, namely a final supervised task operation result p (output | input), and then modeling the task p (output | input), wherein output refers to model output and input refers to model input;

2.4) generally speaking, input and output of the same type of NLP natural language recognition task are expressed by vectors, and input and output are also expressed by the same for the tasks described herein;

2.5) obtaining a prediction conclusion with definite probability values according to the above steps, thereby proving that a single model can be supervised and trained in the form of data;

2.6) obtaining the trained Chinese electronic medical record entity recognition model from the steps;

3) inputting the test set data into a Chinese electronic medical record entity recognition model, and obtaining an optimal labeling sequence of entity recognition through evaluating scores;

3.1) after the test set data is input into the entity recognition model obtained in the step, an optimal sequence is further obtained through an evaluation score formula;

3.2) given sequence x ═ x (x)₁,x₂,x₃,...,x_n) And the corresponding tag sequence y ═ y (y)₁,y₂,y₃,...,y_n) Defining the evaluation score as the following formula:

where W is the transformation matrix, W_i,jIs the label transfer score, P_i,yiY-th representing the character_iScore of individual label, P_iIs defined as:

P_i＝w_sh^(t)+b_s

wherein h is^(t)Is delivered at the moment t of the previous layerData x^(t)Hidden state of (2), parameter w_sRepresenting a weight matrix, b_sRepresents an incremental parameter;

the training for CRF uses maximum conditional likelihood estimation for the training set { (x)_i,y_i) H, wherein the likelihood formula is:

wherein P represents the probability of the sequence from the original sequence to the predicted sequence:

where λ represents a given probability distribution and θ represents a distribution parameter;

adopting the general evaluation indexes of entity identification: precision P, recall R and F values:

wherein, T_pNumber of entities correctly identified for the model, F_pNumber of irrelevant entities identified for the model, F_nNumber of related entities but not detected by the model;

3.3) finally obtaining the named entity of the Chinese electronic case.

Technical Field

The invention relates to the technical field of Chinese language processing and recognition, in particular to a GPT-2 model-based Chinese electronic medical record entity recognition method.

Background

In recent years, with the support and drive of national policies, intelligent medical treatment has entered a rapid development period under the support of advanced technologies such as internet, big data, artificial intelligence, and the like. The national new generation artificial intelligence, brain science and brain-like research major specialties are gradually started and implemented, and the intelligent medical science and technology research and the industrial development step into a new stage. Meanwhile, with the development of economy, people pay more and more attention to their health and medical services provided by society. At present, limited medical resources and medical level can not meet the requirements of people for seeing a doctor and inquiring. For example, for text in an electronic medical record: "fever and lower left abdominal pain appear in the patient, and choledocholithiasis is indicated by CT examination". In this sentence, "CT" is the medical examination method, "fever" and "lower left abdominal pain" are the patient's symptoms, and "choledocholithiasis" is the diagnosed disease. The three entities are named as named entities in entity recognition, the relationship between the three entities is that the 'fever' and the 'left lower abdominal pain' determine that the examination item is 'CT', and the 'CT' examination confirms the occurrence of 'common bile duct stones', namely, the 'common bile duct stones' are expressed as 'fever' and 'left lower abdominal pain' and are confirmed by the 'CT' medical examination mode. The results obtained from the electronic medical records through entity recognition are used as a training set to label each entity and the relationship among the entities, and finally serve a clinical decision and intelligent inquiry system.

Disclosure of Invention

The purpose of the invention is: the method for identifying the Chinese electronic medical record entity based on the GPT-2 model aims to improve the accuracy of the existing entity identification technology and introduce an unsupervised pre-training model. Compared with the prior art, the method can more effectively extract the characteristic vector of the Chinese electronic medical record, can flexibly calculate the text input each time for the whole recognition task, is not limited by the text form, is easy to realize, has lower development and operation cost, can realize large-scale Chinese electronic medical record entity recognition service through one server, and has high judgment speed and accuracy.

In order to achieve the purpose, the invention adopts the technical scheme that:

the Chinese electronic medical record entity recognition method based on the GPT-2 model is characterized in that a GPT-2 pre-training model is used for extracting feature vectors of electronic cases, recognition probability is obtained by taking a CRF model as an outlet, and finally named entities of the Chinese electronic cases are obtained, and the method comprises the following steps:

1.1) the entity classes of the set tag are: body part, symptoms/signs, examinations/tests and diseases/diagnoses;

1.4) outputting label texts of body parts, symptoms/signs, examination/inspection and diseases/diagnosis, wherein the label symbols are in the forms of P, S, T and D, and are PSTD labels for short;

2.5) obtaining a prediction conclusion with definite probability values according to the above steps, thereby proving that a single model can be supervised and trained in the form of data;

2.6) obtaining the trained Chinese electronic medical record entity recognition model from the steps;

3) inputting the test set data into a Chinese electronic medical record entity recognition model, and obtaining an optimal labeling sequence of entity recognition through evaluating scores;

3.1) after the test set data is input into the entity recognition model obtained in the step, an optimal sequence is further obtained through an evaluation score formula;

3.2) given sequence x ═ x (x)₁,x₂,x₃,...,x_n) And the corresponding tag sequence y ═ y (y)₁,y₂,y₃,...,y_n) Defining the evaluation score as the following formula:

where W is the transformation matrix, W_i,jIs the label transfer score, P_i,yiY-th representing the character_iScore of individual label, P_iIs defined as:

P_i＝w_sh^(t)+b_s

wherein h is^(t)Is the input data x at the moment t of the previous layer^(t)Hidden state of (2), parameter w_sRepresenting a weight matrix, b_sRepresents an incremental parameter;

the training for CRF uses maximum conditional likelihood estimation for the training set { (x)_i,y_i) H, wherein the likelihood formula is:

wherein P represents the probability of the sequence from the original sequence to the predicted sequence:

where λ represents a given probability distribution and θ represents a distribution parameter;

adopting the general evaluation indexes of entity identification: precision P, recall R and F values:

wherein, T_pNumber of entities correctly identified for the model, F_pNumber of irrelevant entities identified for the model, F_nNumber of related entities but not detected by the model;

3.3) finally obtaining the named entity of the Chinese electronic case.

The invention has the beneficial effects that: the GPT-2 model-based Chinese electronic medical record entity recognition method provided by the invention has the advantages that an input text is converted into a named entity label, the text to be recognized is input into a trained Chinese electronic medical record entity recognition model, the text is converted into a corresponding label text by the model, and then an entity is outlined in an electronic medical record according to the marked label. The method is not limited by text forms, is easy to implement, has low development and operation cost, can promote the development of clinical diagnosis and AI (artificial intelligence) guide system, and can make contributions to knowledge map construction and semantic web research.

Drawings

FIG. 1 is a flow chart of the GPT2-CRF model.

FIG. 2 is a schematic diagram of the GPT-2 model structure.

FIG. 3 is a schematic diagram of a CRF linear chain structure.

Detailed Description

The invention is further illustrated but not limited by the following figures and examples.

The invention provides a GPT-2 model-based Chinese electronic medical record identification method, which is characterized in that a GPT-2 pre-training model is used for extracting a feature vector of an electronic medical record, a CRF model is used as an outlet to obtain identification probability, a training flow chart of the whole model is shown in figure 1, and finally a named entity of the Chinese electronic medical record is obtained, and the method comprises the following steps:

1.1) 1200 small-scale medical record corpuses are organized and labeled. The medical record management system comprises 300 electronic medical records in four stages of general items, discharge conditions, medical history characteristics and diagnosis and treatment. 30 diseases such as tumor, digestive system diseases, nervous system diseases and the like are covered. For different corpora, repeated sentences do not exist basically.

1.2) the entity classes of the set tags are: body part, symptoms/signs, examination/test and disease/diagnosis.

1.3) 3 doctor groups are set, the first group comprises 5 expert doctors, the second group comprises 5 middle doctors, the third group comprises 5 practice doctors, and the 1200 medical records are manually labeled according to the entity categories to obtain a training set and a testing set of the experiment. The first column of the labeling result is an entity word, the second column is the starting position of the word in the medical record, the third column is the ending position of the word in the medical record, and the last column is the entity category.

1.4) the original data of the Chinese electronic medical record is x ═ x (x)₁,x₂,x₃,...,x_n) The entity is labeled as y ═ y (y)₁,y₂,y₃,...,y_n) X is a medical record original text, y is an entity category label which is equal to and corresponds to the medical record original text, and n is a corresponding data sequence number.

1.5) outputting label texts of body parts, symptoms/signs, examinations/tests and diseases/diagnoses, wherein the label symbols are in the forms of P, S, T and D, and are PSTD labels for short.

2) Based on a GPT-2 pre-training model, introducing a CRF model, establishing a GPT 2-CRF-based Chinese electronic medical record entity recognition model, and training by using training set data to obtain the trained Chinese electronic medical record entity recognition model, wherein FIG. 2 is a model structure of the whole pre-training model, and FIG. 3 is a chain structure showing the relationship in the CRF structure.

2.1) downloading a GPT-2 pre-training model, obtaining the input semantic representation of the text through the GPT-2 pre-training model, simultaneously carrying out supervised training, and finally inputting the result into a label sequence with the maximum probability.

2.2) defining the prediction of the language model as p(s)_n-k,...,s_n|s₁,s₂,...,s_n-k-1) Wherein s represents the prediction result of the original data, k represents the sequence number offset value of the original data, and n represents the number of the predicted value in the original data.

2.3) obtaining the recognition probability by using a CRF model method for estimation, namely obtaining a final supervised task operation result p (output | input), and then modeling the task p (output | input), wherein output refers to model output, and input refers to model input.

2.4) generally, input and output for the same type of NLP (natural language recognition) task are represented by vectors, while input and output are used as well for the tasks described herein.

2.5) obtaining a prediction conclusion with definite probability values according to the above steps, thereby proving that a single model can be supervised and trained in the form of data;

2.6) obtaining the trained Chinese electronic medical record entity recognition model from the steps.

3) And inputting the test set data into the Chinese electronic medical record entity recognition model, and obtaining the optimal labeling sequence of entity recognition by evaluating the scores.

3.1) after the test set data is input into the entity recognition model obtained in the step, an optimal sequence needs to be obtained through an evaluation score formula.

3.2) given sequence x ═ x (x)₁,x₂,x₃,...,x_n) And the corresponding tag sequence y ═ y (y)₁,y₂,y₃,...,y_n) Defining the evaluation score as the following formula:

where W is the transformation matrix, W_i,jIs the label transfer score, P_i,yiY-th representing the character_iThe score of each tag. P_iIs defined as:

P_i＝w_sh^(t)+b_s

wherein h is^(t)Is the input data x at the moment t of the previous layer^(t)Hidden state of (2), parameter w_sRepresenting a weight matrix, b_sRepresents an incremental parameter;

the training for CRF uses maximum conditional likelihood estimation for the training set { (x)_i,y_i) H, wherein the likelihood formula is:

wherein P represents the probability of the sequence from the original sequence to the predicted sequence:

where λ represents a given probability distribution and θ represents a distribution parameter;

adopting the general evaluation indexes of entity identification: precision (P), recall (R) and F values:

wherein, T_pNumber of entities correctly identified for the model, F_pNumber of irrelevant entities identified for the model, F_nIs the number of related entities but not detected by the model.

3.3) finally obtaining the named entity of the Chinese electronic case.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种用于含噪稀疏文本的语义关系抽取方法

GPT-2 model-based Chinese electronic medical record entity identification method

相关技术

网友询问留言