NPL-based automatic medicine business card extraction method

文档序号：1087401 发布日期：2020-10-20 浏览：8次中文

阅读说明：本技术 一种基于npl的药品名片自动提取方法 (NPL-based automatic medicine business card extraction method ) 是由邵志宇傅建强黄艳陈龙彪蔡晓海游建议林志铭李灵琦伊丽于 2020-06-30 设计创作，主要内容包括：本发明公开了一种基于NLP的药品名片自动提取方法,方法具体包括如下步骤：步骤S1,对药品说明书进行预处理,构造训练集、验证集和测试集；步骤S2,加载训练集和验证集,进行数据封装和数据读取；步骤S3,加载BERT模型的配置数据和模型特征进行目标模型的参数初始化过程,构建BERT-BILSTM-CRF模型,训练BERT-BILSTM-CRF模型；步骤S4,利用训练好的BERT-BILSTM-CRF模型,加载测试集,对测试集数据预测输出药品名片字段内容。本发明提供的基于NLP的药品名片自动提取方法,有效提高提取效率和准确率,极大地减少人力成本。(The invention discloses an automatic extraction method of a medicine business card based on NLP, which specifically comprises the following steps: step S1, preprocessing the drug specification, and constructing a training set, a verification set and a test set; step S2, loading a training set and a verification set, and performing data encapsulation and data reading; step S3, loading configuration data and model characteristics of the BERT model to perform a parameter initialization process of the target model, constructing the BERT-BILSTM-CRF model, and training the BERT-BILSTM-CRF model; and step S4, loading a test set by using the trained BERT-BILSTM-CRF model, and predicting and outputting the field content of the medicine business card for the test set data. The automatic extraction method of the medicine business card based on NLP effectively improves extraction efficiency and accuracy and greatly reduces labor cost.)

1. An automatic extraction method of a medicine business card based on NLP comprises the following steps:

step S1, preprocessing the drug specification, and constructing a training set, a verification set and a test set;

step S2, loading a training set and a verification set, and performing data encapsulation and data reading;

step S3, loading configuration data and model characteristics of the BERT model to perform a parameter initialization process of the target model, constructing the BERT-BILSTM-CRF model, and training the BERT-BILSTM-CRF model;

and step S4, loading a test set by using the trained BERT-BILSTM-CRF model, and predicting and outputting the field content of the medicine business card for the test set data.

2. The method for automatically extracting NLP-based medicine business cards according to claim 1, wherein the step S1 includes:

s11, storing the contents in the medicine specification in each text in a blocking manner according to keywords;

s12, merging the texts with the same keywords to construct a data set;

s13, labeling data of the data set according to a BIO representation method to obtain a training set, a verification set and a test set;

and S14, performing data cleaning on the training set, the verification set and the test set.

3. The method of claim 2, wherein the keywords include at least two of drug name, indication, usage amount, pharmacological action, adverse reaction, cautionary matters and contraindications.

4. The method for automatically extracting NLP-based medicine business cards according to claim 1, wherein the step S2 includes:

s21, loading a training set and a verification set to obtain an input sample of data, namely a sample, a sample code and a label;

s22, constructing an evaluation controller;

s23, packaging all input samples into data in a tf _ record format, and inputting the data serving as model data;

and S24, reading the data in the tf _ record format to form batch data.

5. The method as claimed in claim 4, wherein the step S23 specifically includes: establishing a mapping dictionary of labels and codes, storing the dictionary, segmenting data, performing sequence truncation, adding separators CLS and SEP symbols at the head and tail of the sequence, and structuring the sequences into a characteristic set object of the data.

6. The method for automatically extracting NLP-based medicine business cards according to claim 1, wherein the step S3 includes:

s31, constructing a model, loading configuration data and model characteristics of the BERT model, and obtaining a word vector of a corresponding word;

s32, loading a BILSTM-CRF model object, and constructing a BERT-BILSTM-CRF model;

and S33, training the training set by using the evaluation controller.

7. The method for automatically extracting NLP-based medicine business cards according to claim 1, wherein the step S4 includes:

s41, recovering the model according to the BERT-BILSTM-CRF model parameters; loading a mapping dictionary of the label and the code;

s42, performing word segmentation on the test set text data, converting the words into word vectors, converting the labels into corresponding codes, and structuring the codes into characteristic set objects of the data;

s43, acquiring codes of words, input masks, segment codes and labels in input samples of each text, and operating a session according to the codes of the words and the input masks in the input samples to acquire a current predicted label coding result;

s44, converting the result of the coding form into a real sequence label result according to the loaded label and the coding mapping dictionary;

and S45, acquiring a labeling result according to the combination of the real sequence label result and the input sequence, and outputting the extracted medicine business card field.

Technical Field

The invention relates to the field of information processing, in particular to an NPL-based automatic medicine business card extraction method.

Background

The medicine specification is a legal document for specifying important information of the medicine, is a legal guideline for selecting the medicine, is a basic source of the use specification of the medicine and the medicine information, is a scientific basis for doctors, pharmacists, nursing staff and patients during treatment and medication, has medical authority and legal effect, and contains basic scientific information such as safety, effectiveness and the like of the medicine. The medicine business card is a medicine knowledge business card extracted by taking a medicine specification as a bottom database, and is the fastest and effective method for knowing the medicine.

With the national emphasis on internet technology, the medicine maintenance system of the hospital has entered the intelligent era, and a large number of medicine specifications are accumulated in the hospital. The specifications of the drugs contain the ingredients and properties of the drugs, the usage and dosage, contraindications, the objects to be inoculated, pharmacological actions, indications and cautions. The construction and maintenance of the medicine business card are very important.

In recent years, deep learning has been rapidly advanced, and enormous efforts have been made in the fields of speech recognition, image processing, natural language processing, and the like. The existing medicine business card maintenance method mainly relies on recognition of professional knowledge by pharmacists to perform field recognition on medicine descriptions, medicine business card contents are filled manually, the maintenance efficiency is low, time and labor are consumed, meanwhile, the medicine maintenance method also comprises a rule-based method, a large amount of labor cost is consumed, the rule is difficult to design, a large amount of labor is needed to construct the medicine business card according to a medicine specification, and the labor cost is huge for medicine business card maintenance.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides an automatic extraction method of a medicine business card based on NLP (natural language processing), which effectively improves the extraction efficiency and accuracy and greatly reduces the labor cost.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an automatic extraction method of a medicine business card based on NLP comprises the following steps:

step S1, preprocessing the drug specification, and constructing a training set, a verification set and a test set;

step S2, loading a training set and a verification set, and performing data encapsulation and data reading;

and step S4, loading a test set by using the trained BERT-BILSTM-CRF model, and predicting and outputting the field content of the medicine business card for the test set data.

Preferably, the step S1 includes:

storing the contents in the medicine specification in blocks according to keywords in each text;

combining all texts with the keywords to construct a data set;

labeling data of the data set according to a BIO representation method to obtain a training set, a verification set and a test set;

and performing data cleaning on the training set, the verification set and the test set.

Preferably, the keywords include, but are not limited to: the name of the medicine, indications, usage and dosage, pharmacological action, adverse reactions, cautions and contraindications.

Preferably, the step S2 includes:

s21, loading a training set and a verification set to obtain an input sample of data, namely a sample, a sample code and a label;

s22, constructing an evaluation controller;

s23, packaging all input samples into data in a tf _ record format, and inputting the data serving as model data;

and S24, reading the data in the tf _ record format to form batch data.

Preferably, in S23, a mapping dictionary of labels and codes is created, the mapping dictionary is stored, the data is segmented, the sequence is truncated to a length of-2, and separators CLS and SEP symbols are added to the beginning and the end of the sequence and structured into the feature set object of the data.

Preferably, the step S3 includes:

s31, constructing a model, loading configuration data and model characteristics of the BERT model, and obtaining a word vector of a corresponding word;

s32, loading a BILSTM-CRF model object, and constructing a BERT-BILSTM-CRF model;

and S33, training the training set by using the evaluation controller.

Preferably, the step S4 includes:

s41, recovering the model according to the BERT-BILSTM-CRF model parameters; loading a mapping dictionary of the label and the code;

s44, converting the result of the coding form into a real sequence label result according to the loaded label and the coding mapping dictionary;

and S45, acquiring a labeling result according to the combination of the real sequence label result and the input sequence, and outputting the extracted medicine business card field.

Compared with the prior art, the invention has the following beneficial effects:

1. compared with the manual extraction method adopted by the existing method, the method has excellent extraction efficiency and accuracy, and greatly reduces the labor cost.

The invention is further explained in detail with the accompanying drawings and the embodiments; however, the method for automatically extracting the medicine business card based on NLP of the present invention is not limited to the embodiment.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a sample schematic of a pharmaceutical instruction according to an embodiment of the present invention;

FIG. 3 is a sample diagram of a training set according to an embodiment of the present invention;

FIG. 4 is a graphical representation of the results of test set accuracy for an embodiment of the present invention.

Detailed Description

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于导入Excel与三维模型互补的变电站工程方法

NPL-based automatic medicine business card extraction method

相关技术

网友询问留言