Medical corpus labeling method

文档序号：1465983 发布日期：2020-02-21 浏览：4次中文

阅读说明：本技术 一种医疗语料标注方法 (Medical corpus labeling method ) 是由孙广阳程岚祝伟于 2019-11-13 设计创作，主要内容包括：本发明提出了一种医疗语料标注方法，包括：注册账号，向用户分配任务；标注界面呈现原始语料，由用户根据原始语料查找到对应的选项进行单选；如果多个选项中没有标准结果，则在选项下输入需要和正规ICD9、ICD10两类编码字典相同；允许用户对选择的结果进行标记；如果语料标注为复合语料，则在拆分界面进行拆分，设置为拆分列表；如果语料标注为正常语料，则设置为正常标注列表，否则设置为丢弃列表；如果语料为不可识别语料，则标注为疑问语料；检查完毕后，确定无误提交任务；分享并导出检查结果。本发明可以将医疗语料规范化，所产生的数据可以为医疗数据的信息化提供便捷的方法，统一的流程，高效的进度。(The invention provides a medical corpus labeling method, which comprises the following steps: registering an account number and distributing tasks to a user; the marking interface presents the original corpus, and a user finds a corresponding option according to the original corpus to perform single selection; if the standard result does not exist in a plurality of options, the input requirement under the options is the same as the encoding dictionaries of the normal ICD9 and the ICD 10; allowing the user to mark the selected result; if the corpus is marked as a composite corpus, splitting the corpus on a splitting interface, and setting the corpus as a splitting list; if the corpus is marked as normal corpus, setting the corpus as a normal marking list, otherwise, setting the corpus as a discarding list; if the corpus is the unidentifiable corpus, marking the corpus as a query corpus; after the inspection is finished, determining to submit the task without errors; and sharing and exporting the checking result. The invention can standardize the medical linguistic data, and the generated data can provide a convenient method for informatization of the medical data, uniform flow and high-efficiency progress.)

1. A medical corpus labeling method is characterized by comprising the following steps:

step S1, registering an account, and distributing tasks to the user, wherein the distributed tasks comprise ICD9 and ICD10, and the specified number is distributed; the user logs in the annotation interface and selects a task of annotating ICD9 or ICD 10;

step S2, the marking interface presents the original corpus, and the user finds the corresponding option according to the original corpus to perform single selection;

step S3, in the labeling process, if the multiple options have no standard result, the input needs under the options are the same as the two types of coding dictionaries of regular ICD9 and ICD 10; allowing the user to mark the selected result, wherein the marking state comprises: compound corpus, normal corpus and query corpus;

step S4, if the corpus is marked as a composite corpus, splitting on a splitting interface, and setting the split interface as a splitting list; if the corpus is marked as normal corpus, setting the corpus as a normal marking list, otherwise, judging the corpus as useless corpus, and setting the corpus as a discarding list; if the corpus is the unidentifiable corpus, marking the corpus as a query corpus;

step S5, after the user finishes the labeling task, the labeled content is inspected on the inspection interface, the normal labeling list, the splitting list, the question list and the discarding list are checked on the inspection interface, and the reselection and the splitting are carried out in the inspection process;

step S6, after the check is finished, the task is submitted without errors, the submitted data is matched with all the data once, and the same linguistic data are automatically marked;

step S7, the question list and the discard list are shared, and the check result is derived.

2. The method for labeling medical corpus according to claim 1, wherein each corpus in the database is encoded into a new column by a NLP algorithm to obtain a plurality of most similar codes, and the obtained result is used as a data source.

3. The method for labeling medical corpus according to claim 1, wherein when it is detected that the corpus is a compound corpus, the method enters a splitting interface to split the corpus, similarity matching is performed on the split corpus in an ICD dictionary database through an NLP algorithm, similar options are searched for to be selected by a user, and the split corpus is stored in a database.

4. The method for labeling medical corpus according to claim 1, wherein the useless corpus is discarded, and the discarded corpus is not deleted directly, but deleted only in the task list, and deleted after being confirmed in the check interface.

5. The medical corpus tagging method of claim 1, wherein a total task volume and a remaining task bar viewing interface are provided to a user during tagging.

6. The method for labeling medical corpus according to claim 1, wherein during the labeling process, if the user logs out, the user directly displays the current labeled position at the next login.

7. The method for labeling medical corpus according to claim 1, wherein in the labeling process, if the corpus is useless corpus, the corpus is marked as "discarded" after determination, wherein the corpus marked as discarded is not directly deleted and is further checked in a check interface; if so, recycle the re-label, otherwise, delete.

Technical Field

The invention relates to the technical field of data processing, in particular to a medical corpus labeling method.

Background

In the current era of increasingly informative medical treatment, it is necessary to standardize medical information to be incorporated into large data. Because many disease names are difficult to remember and are not suitable for the mouth, doctors can simplify or colloquially the disease names when inputting the information of patients, so that the information is not standard, the universality is poor, and the query is inconvenient. The medical record encoder works to standardize the irregular information, but the manual speed is slow, the task amount is large, the classification is not clear, and the working efficiency is low.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical drawbacks mentioned.

Therefore, the invention aims to provide a medical corpus labeling method.

In order to achieve the above object, an embodiment of the present invention provides a method for labeling medical corpus, including the following steps: step S1, registering an account, and distributing tasks to the user, wherein the distributed tasks comprise ICD9 and ICD10, and the specified number is distributed; the user logs in the annotation interface and selects a task of annotating ICD9 or ICD 10;

step S2, the marking interface presents the original corpus, and the user finds the corresponding option according to the original corpus to perform single selection;

step S6, after the check is finished, the task is submitted without errors, the submitted data is matched with all the data once, and the same linguistic data are automatically marked;

step S7, the question list and the discard list are shared, and the check result is derived.

Furthermore, each corpus in the database is subjected to NLP algorithm to obtain a plurality of most similar codes to form a new column, and the obtained result is used as a data source.

Further, when the fact that the corpus is the composite corpus is detected, the corpus enters a splitting interface, the corpus is split, similarity matching is conducted on the split result in an ICD dictionary base through an NLP algorithm, similar options are searched for a user to select, and the split corpus is stored in a database.

Furthermore, the useless corpora are discarded, and the discarded corpora are not directly deleted, but only deleted in the task list and deleted after being confirmed in the check interface.

Further, in the annotation process, the total task amount and the remaining taskbar viewing interface are provided to the user.

Further, in the marking process, if the user logs out, the current marked position is directly displayed in the next login.

Further, in the labeling process, if the corpus is a useless corpus, the corpus is marked as 'discarded' after being determined, wherein the corpus marked as discarded is not directly deleted, and is further checked on a checking interface; if so, recycle the re-label, otherwise, delete.

According to the medical corpus labeling method provided by the embodiment of the invention, a login registration function, a user task allocation function, a labeling function, a checking function, a sharing function, a submitting function and a exporting function can be realized. Firstly, each corpus in a database is subjected to NLP algorithm to obtain 10 most similar codes to form a new column, an obtained result table is used as a data source, data are displayed on a Web end, and labeling is carried out in a radio box, an input box and various forms, and the labeled correct result can be used for automatically labeling the same corpus in the rest corpora, so that the labeling efficiency is improved, and finally, the purpose of standardizing the nonstandard diagnosis names in the corpora is achieved. The invention can standardize the medical linguistic data, and the generated data can provide a convenient method for informatization of the medical data, and has uniform flow and high-efficiency progress.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a block diagram of a process of a method for annotating medical corpus according to an embodiment of the present invention;

FIG. 2 is a block diagram of a method for annotating medical corpus according to an embodiment of the present invention;

fig. 3 is a flowchart of a medical corpus tagging method according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The invention provides a medical corpus labeling method, which can label and classify medical corpuses at a Web end through single selection, and can effectively solve the problems of heavy workload, no effective tool, low inspection speed, repeated work tasks and low work efficiency of the existing medical record coders. The present invention may provide the following main functions: login and registration function, user task allocation function, marking function, checking function and export function.

As shown in fig. 1, the method for labeling medical corpus according to the embodiment of the present invention includes the following steps:

step S1, registering an account, and distributing tasks to the user, wherein the distributed tasks comprise ICD9 and ICD10, and the specified number is distributed; the user logs into the annotation interface and selects the task of annotating ICD9 or ICD 10.

In this step, the login and registration function is intended to be able to make a cushion for the subsequent task allocation according to the difference of the user's field. And in the user task allocation function, appropriate corpus tasks are allocated to different users and the newly added data are dealt with.

Specifically, an administrator registers an account; tasks are distributed to users, the tasks are divided into ICD9 and ICD10, and a specified number of tasks can be distributed to the users; the user logs into the annotation interface and can select a task to annotate ICD9 or ICD10, as shown in FIG. 2.

And step S2, the marking interface presents the original corpus, and the user finds the corresponding option according to the original corpus to perform single selection.

Specifically, the labeling interface presents the original corpus and 10 options, and the user can find the corresponding option according to the original corpus to perform single selection. And obtaining a plurality of most similar codes to form a new column through each corpus in the database by an NLP algorithm, and taking the obtained result as a data source. The user can label the corpora on the Web browser interface, and each corpus is subjected to similarity matching in the ICD dictionary through the advanced NLP algorithm, so that the query burden of the user is greatly reduced.

In addition, the user selects through the radio box on the Web browser interface, and clicks the marking button to mark the current corpus, and the method is separated from the traditional marking mode of manual codewords.

Step S3, in the labeling process, if the multiple options have no standard result, the input needs under the options are the same as the two types of coding dictionaries of regular ICD9 and ICD 10; allowing the user to mark the selected result, wherein the marking state comprises: compound corpus, normal corpus, and query corpus.

In the labeling process, a user selects 10 single options which are most similar to the original corpus, if no standard result exists in the 10 options, the user can input the options through a self-defined input box, the input content needs to be the same as that of a regular ICD9/ICD10 coding dictionary, in order to improve the user experience, the input box has an automatic association function, and the user can easily find the codes which are required to be input. If the original corpus is a compound corpus (for example, AB, which can be classified into A and B), the original corpus can be split on the splitting interface. In the annotation process, if the user does not determine the result selected by the user, the following steps can be selected: 1. select first, then mark as "question select", wait for subsequent processing. 2. Labeled directly as "question". If the corpus is garbage, the discard button may be clicked to discard.

Step S4, if the corpus is marked as a composite corpus, splitting on a splitting interface, and setting the split interface as a splitting list; if the corpus is marked as normal corpus, setting the corpus as a normal marking list, otherwise, judging the corpus as useless corpus, and setting the corpus as a discarding list; and if the corpus is the unidentifiable corpus, marking the corpus as the query corpus.

Referring to fig. 3, in the corpus tagging process, if a user encounters a compound corpus, a query corpus, and a garbage corpus, the following processing may be performed:

(1) composite corpus

When the fact that the linguistic data are the composite linguistic data is detected, a user can click a splitting button to enter a splitting interface to split the linguistic data, after splitting, the system can conduct similarity matching on the linguistic data in an ICD dictionary base through an NLP algorithm according to the split results, quickly find out similar options for the user to select, and finally store the split linguistic data in a database together.

For example: in the labeling process, if the corpus is a compound corpus, such as "tinea manuum and tinea pedis", it should be "tinea manuum" and "tinea pedis". The corpus should be split, and a user can click a split button to enter a split interface, so that the corpus is split into the corresponding corpus and the corresponding code to be labeled.

(2) Query corpus

If a query corpus is encountered, the user can have two solutions: 1. a user can click a question button, and the corpus is stored in a question list and is not processed temporarily; 2. if the user asks a question for a certain option, the user can click on the option and then click on the question mark button. These operations are recorded and sorted for query at the final inspection interface.

(3) Useless corpus

For useless corpora, a user can click a discard button to discard, the discarded corpora are not directly deleted, only deleted in a task list, and finally deleted really after being confirmed by a check interface.

That is, in the labeling process, if the corpus is a useless corpus, after the corpus is determined, the corpus may be marked as "discarded", the corpus marked as discarded may not be directly deleted, and may be further checked on the check interface, and if the corpus can be recycled, the corpus may be re-marked, otherwise, the corpus may be deleted.

During the labeling process, each time a corpus is labeled, the current state of the corpus is displayed so as to be convenient for a user to view, and the total task amount and the remaining task bar viewing interface are provided for the user.

In the marking process, if the user logs out, the current marked position is directly displayed in the next login process, so that the user experience is improved.

Step S5, after the user finishes the labeling task, the labeled content is inspected on the inspection interface, the normal labeling list, the splitting list, the question list and the discarding list are checked on the inspection interface, and the reselection and the splitting are carried out in the inspection process.

In the checking function of this step, the marked linguistic data are mainly checked, and the linguistic data in the state of "question" is processed, so as to ensure the accuracy of the final result.

In the checking process, if the linguistic data of the question list is marked again, the linguistic data can enter a normal marking list, and the linguistic data of the list is discarded.

The user can also enter the check interface to check in the marking process, and the check interface is divided into four list areas for the user to check, which are respectively: a label list, a split list, a question list, and a discard list.

Step S6, after the check is finished, the task is submitted without errors, the submitted data is matched with all the data once, and the same linguistic data are automatically marked;

the submission function of the step is that after the annotation task is checked, a submit task button is clicked, and the background can match the corpus in the rest tasks to automatically complete annotation.

After the inspection is finished, the task can be submitted after the inspection is determined to be correct. The submitted data can match all the data once, and the same linguistic data are automatically labeled, so that the task amount is reduced; after the check of the check interface is finished, a task submitting button can be clicked, the submitted task is only the content of the label list, other lists cannot be submitted, the submitted task can be matched with the task list of all people, the same linguistic data are automatically labeled, and the user is prevented from working repeatedly;

the server is connected with a database, and the database contains user login registration information, ICD dictionaries of different types, medical record corpus data of different types, result data after user labeling and user operation records; and after logging in, any user marks the N linguistic data to be marked, further checks the linguistic data after marking is finished, and can submit the task after the correctness is confirmed, the submitted task data can match all the subsequent linguistic data, and the same linguistic data can be automatically marked.

In step S7, the question list and the discard list are shared to derive the check result.

The sharing function in this step is to share the list with the completed label on the inspection interface, so that other users can refer to or request help.

The export function of the step, the content of the check interface supports exporting to the formats such as excel, database and the like, and the export to the excel table and the database for persistent storage can be realized.

It should be noted that the query list and the discard list generated in the present invention can be shared by other users to help solve the problem, so as to complete the task quickly.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：在线表格的数据处理方法、装置、终端及服务器

Medical corpus labeling method

相关技术

网友询问留言