Disease name code matching method and device, computer equipment and storage medium

文档序号：1185115 发布日期：2020-09-22 浏览：6次中文

阅读说明：本技术 疾病名称对码方法、装置、计算机设备及存储介质 (Disease name code matching method and device, computer equipment and storage medium ) 是由金晓辉阮晓雯徐亮于 2020-04-26 设计创作，主要内容包括：本申请实施例属于人工智能领域,涉及一种疾病名称对码方法、装置、计算机设备及存储介质,所述方法包括：从电子病历中获取疾病名称列表；对所述疾病名称列表中重复的疾病名称进行去重处理,得到去重后的疾病名称列表；将所述去重后的疾病名称列表输入到精确匹配模型中,依据标准疾病分类表进行对码,得到第一对码结果和候选对码疾病名称；将得到的候选对码疾病名称输入到模糊匹配模型中,依据所述标准疾病分类表进行对码,得到第二对码结果；根据所述第一对码结果和所述第二对码结果生成疾病名称对码列表。本申请对疾病名称进行多维度、多模式的对码,提高了疾病名称对码的准确率,疾病名称列表还可存储于区块链中以提高数据的隐私性和安全性。(The embodiment of the application belongs to the field of artificial intelligence, and relates to a disease name code matching method, a disease name code matching device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a disease name list from an electronic medical record; carrying out duplication elimination processing on repeated disease names in the disease name list to obtain a duplication eliminated disease name list; inputting the de-duplicated disease name list into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names; inputting the obtained candidate code matching disease names into a fuzzy matching model, and performing code matching according to the standard disease classification table to obtain a second code matching result; and generating a disease name pair code list according to the first pair code result and the second pair code result. According to the method and the device, multi-dimensional and multi-mode code matching is carried out on the disease names, the accuracy of code matching of the disease names is improved, and the disease name list can be stored in the block chain so as to improve the privacy and safety of data.)

1. A disease name code matching method is characterized by comprising the following steps:

acquiring a disease name list from an electronic medical record;

carrying out duplication elimination processing on repeated disease names in the disease name list to obtain a duplication eliminated disease name list;

inputting the de-duplicated disease name list into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names;

inputting the obtained candidate code matching disease names into a fuzzy matching model, and performing code matching according to the standard disease classification table to obtain a second code matching result;

and generating a disease name pair code list according to the first pair code result and the second pair code result.

2. The disease name code matching method according to claim 1, wherein the exact match model is composed of a plurality of ordered exact match submodels; the step of inputting the de-duplicated disease name list into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names specifically comprises:

inputting each disease name in the disease name list after the duplication removal into an accurate matching sub-model according to the arrangement sequence of the accurate matching sub-models in the accurate matching model;

inquiring a standard disease name matched with the input disease name in a standard disease classification table through the current accurate matching sub-model;

when the matched standard disease name is inquired, the inquired standard disease name and the disease code corresponding to the standard disease name are used as a first code matching result of the disease name;

when the matched standard disease name is not inquired by the current precise matching sub-model, inputting the disease name to the next precise matching sub-model to continuously execute matching;

and if the disease name is not matched by each precise matching submodel, marking the disease name as a candidate code matching disease name.

3. The disease name code matching method according to claim 2, wherein the step of inputting each disease name in the de-duplicated disease name list to the exact match submodel according to the arrangement order of the exact match submodel in the exact match model specifically comprises:

inputting each disease name in the de-duplicated disease name list into an accurate matching sub-model according to the arrangement sequence of four accurate matching sub-models in the accurate matching model; the four accurate matching submodels comprise a complete matching submodel, a stop word submodel, a primary and secondary separation submodel and a synonymy recognition submodel.

4. The disease name code matching method according to claim 1, wherein the fuzzy matching model is composed of a plurality of fuzzy matching submodels; the step of inputting the obtained candidate code matching disease name into a fuzzy matching model, and performing code matching according to the standard disease classification table to obtain a second code matching result specifically comprises:

inputting the obtained candidate code matching disease names into each fuzzy matching sub-model in the fuzzy matching model;

calculating the similarity between the candidate code matching disease name and each standard disease name in the standard disease classification table based on each fuzzy matching sub-model;

and generating a second pair of code results according to the similarity obtained by calculating each fuzzy matching submodel.

5. The method according to claim 4, wherein the step of inputting the obtained candidate pair-code disease names into each fuzzy matching sub-model in the fuzzy matching model specifically comprises:

and inputting the obtained candidate code matching disease names into four fuzzy matching submodels in the fuzzy matching model, wherein the four fuzzy matching submodels comprise a word frequency matching submodel, an N-Gram submodel, an editing distance submodel and a cosine calculation submodel.

6. The disease name matching method according to claim 5, wherein when the fuzzy matching sub-model is an edit distance sub-model, the step of calculating the similarity between the candidate matching disease name and each standard disease name in the standard disease classification table specifically comprises:

calculating the text editing distance between the candidate code-matching disease name and each standard disease name in the standard disease classification table;

and normalizing each text editing distance, and taking each text editing distance after normalization as the similarity between the candidate code-matching disease name and each standard disease name.

7. The disease name code matching method according to claim 4, wherein the step of generating a second code matching result according to the similarity calculated by each fuzzy matching sub-model specifically comprises:

for each candidate code-pair disease name, screening a standard disease name and a disease code corresponding to the maximum similarity from the similarities obtained by calculation of the fuzzy matching submodels, and performing HardVoting fusion to obtain a second code-pair result;

alternatively, the first and second electrodes may be,

and performing SoftVoting fusion according to the similarity obtained by calculating each fuzzy matching model to obtain a second code pairing result.

8. A disease name code matching apparatus, comprising:

the list acquisition module is used for acquiring a disease name list from the electronic medical record;

the list duplication removing module is used for carrying out duplication removing processing on the duplicated disease names in the disease name list to obtain a duplication removed disease name list;

the accurate matching module is used for inputting the de-duplicated disease name list into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names;

the fuzzy matching module is used for inputting the obtained candidate code matching disease names into a fuzzy matching model, and performing code matching according to the standard disease classification table to obtain a second code matching result;

and the list generating module is used for generating a disease name code matching list according to the first code matching result and the second code matching result.

9. A computer device comprising a memory having stored therein a computer program and a processor which when executed implements the steps of the disease name pair code method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, realizes the steps of the disease name pairing method according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for checking a disease name, a computer device, and a storage medium.

Background

With the development of internet technology, the use of electronic medical records is more popular. Electronic medical records refer to individual health information recorded as digitized information during a medical activity. The electronic medical record is created, collected, managed and consulted by medical staff and contains long-time span, multi-dimensional and rich individual health information. ICD-10(International Classification of diseases, version 10) is an internationally unified disease Classification method established by the world health organization, which classifies diseases into classes and expresses the classes in a coding manner, so that the classes of diseases are an ordered combination. The ICD-10 records approximately 26000 disease types, covering all the categories of diseases. In the existing medical field, research on application of electronic medical records is generally based on ICD-10 codes, and therefore it is important to match codes of disease names in electronic medical records, that is, to correspond the disease names in the electronic medical records to the ICD-10 codes.

The existing ICD-10 special code query database system provides a query function from a Chinese disease name to an ICD-10 code, but can only query the medical history text containing a standard disease name. Because the culture levels of medical care personnel in medical institutions are different and the habits of disease expression are different, a large number of irregular expressions such as spoken expressions, disease name abbreviation expressions and different expressions of the same disease exist in the electronic medical records. Therefore, the existing ICD-10 special coding query database system has a large amount of data which cannot be processed when performing disease name code matching, and the accuracy rate of the disease name code matching is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, a computer device and a storage medium for code matching of disease names, so as to solve the problem of low accuracy of code matching of disease names.

In order to solve the above technical problem, an embodiment of the present application provides a disease name code matching method, which adopts the following technical solutions:

acquiring a disease name list from an electronic medical record;

carrying out duplication elimination processing on repeated disease names in the disease name list to obtain a duplication eliminated disease name list;

and generating a disease name pair code list according to the first pair code result and the second pair code result.

Furthermore, the accurate matching model consists of a plurality of orderly arranged accurate matching sub-models; the step of inputting the de-duplicated disease name list into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names specifically comprises:

inquiring a standard disease name matched with the input disease name in a standard disease classification table through the current accurate matching sub-model;

when the matched standard disease name is not inquired by the current precise matching sub-model, inputting the disease name to the next precise matching sub-model to continuously execute matching;

and if the disease name is not matched by each precise matching submodel, marking the disease name as a candidate code matching disease name.

Further, the step of inputting each disease name in the de-duplicated disease name list to an exact match sub-model according to the arrangement sequence of the exact match sub-models in the exact match model specifically includes:

Further, the fuzzy matching model consists of a plurality of fuzzy matching submodels; the step of inputting the obtained candidate code matching disease name into a fuzzy matching model, and performing code matching according to the standard disease classification table to obtain a second code matching result specifically comprises:

inputting the obtained candidate code matching disease names into each fuzzy matching sub-model in the fuzzy matching model;

calculating the similarity between the candidate code matching disease name and each standard disease name in the standard disease classification table based on each fuzzy matching sub-model;

and generating a second pair of code results according to the similarity obtained by calculating each fuzzy matching submodel.

Further, the step of inputting the obtained candidate matching disease name into each fuzzy matching sub-model in the fuzzy matching model specifically includes:

Further, when the fuzzy matching sub-model is an edit distance sub-model, the step of calculating the similarity between the candidate match-code disease name and each standard disease name in the standard disease classification table specifically includes:

calculating the text editing distance between the candidate code-matching disease name and each standard disease name in the standard disease classification table;

Further, the step of generating a second pair code result according to the similarity calculated by each fuzzy matching sub-model specifically includes:

alternatively, the first and second electrodes may be,

and performing SoftVoting fusion according to the similarity obtained by calculating each fuzzy matching model to obtain a second code pairing result.

In order to solve the above technical problem, an embodiment of the present application further provides a disease name code matching apparatus, including:

the list acquisition module is used for acquiring a disease name list from the electronic medical record;

and the list generating module is used for generating a disease name code matching list according to the first code matching result and the second code matching result.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the disease name code matching method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the disease name code matching method described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: firstly, the disease name list is subjected to duplication removal so as to reduce the calculation amount; inputting the de-duplicated disease name list into an accurate matching model for accurate matching to obtain a first code matching result, inputting the disease names which cannot realize accurate matching as candidate code matching disease names into a fuzzy matching model for fuzzy matching to obtain a second code matching result, and performing code matching according to a standard disease classification table during two times of code matching; and finally, generating a disease name code matching list according to the first code matching result and the second code matching result, and performing multi-dimensional and multi-mode code matching on the disease name through precise matching and fuzzy matching, so that the accuracy of the disease name code matching is improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a disease name pair code method according to the present application;

FIG. 3 is a flowchart of one embodiment of step S203 in FIG. 2;

FIG. 4 is a flowchart of one embodiment of step S204 of FIG. 2;

FIG. 5 is a schematic diagram of one embodiment of a disease name code-matching apparatus according to the present application;

FIG. 6 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving picture experts Group Audio Layer iv, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the disease name code matching method provided in the embodiments of the present application is generally executed by a server, and accordingly, the disease name code matching apparatus is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow diagram of one embodiment of a disease name-to-code method according to the present application is shown. The disease name code matching method comprises the following steps:

step 201, a disease name list is obtained from an electronic medical record.

In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the disease name pair code method operates may communicate with a terminal or a server through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The disease name list may be a list of disease names described in the electronic medical record.

Specifically, the information described in the electronic medical record is structured, and for example, the electronic medical record includes disease name information, symptom record information, and treatment information. The server reads a large number of disease names from the structured electronic medical records to obtain a disease name list. In the disease name list, the electronic medical record identification is stored corresponding to the disease name. The electronic medical record identifier is an identifier of an electronic medical record, and the electronic medical record identifier can be a character string combined by letters, numbers, special symbols and the like.

In one embodiment, the electronic medical records read by the server can come from various terminals or a preset database.

In one embodiment, the server may set a timing task to time the disease name pairing, such as once per month or quarter, and may set a timing task to activate the pairing task at a particular time per month or quarter. The server may use Cron (timed task) in Linux to trigger information synchronization command, and Cron may execute a specific task at a scheduled time.

Step 202, performing deduplication processing on the repeated disease names in the disease name list to obtain a deduplicated disease name list.

In particular, there may be a large number of identical disease names in the list of disease names. For example, in a period of high influenza, a plurality of patients suffering from influenza go to a hospital for treatment, at the moment, electronic medical records obtained by the hospital have a plurality of disease names such as "influenza", and if the duplication removal processing is not performed, the server needs to perform a large amount of repeated calculation, so that the calculation amount is increased, and the code matching efficiency is reduced.

The server firstly identifies the disease names which repeatedly appear in the disease name list, and then the server performs duplication elimination processing to obtain the duplication eliminated disease name list.

Only one of a large number of repeated disease names can be reserved, the rest of the repeated disease names are deleted, and the electronic medical record identification corresponding to the deleted disease name is stored in association with the electronic medical record identification corresponding to the reserved disease name, so that the code matching results of all the disease names are restored finally; the electronic medical record identifier corresponding to the reserved disease name can be the original electronic medical record identifier, or the electronic medical record identifier corresponding to the reserved disease name can be reset.

Step 203, inputting the list of the disease names after the duplication removal into an accurate matching model, and performing code matching according to a standard disease classification table to obtain a first code matching result and candidate code matching disease names.

Specifically, the precise matching model precisely matches the disease names from the textual level, and the precise matching model performs code matching on the disease names according to the standard disease classification table, that is, precisely matches the disease names with the standard disease names in the standard disease classification table. The standard disease Classification table stores standard disease names and disease codes corresponding to the standard disease names, and the standard disease Classification table may be version 10 of International Classification of Diseases (ICD): ICD-10.

When the disease name can be matched exactly with a standard disease name, the standard disease name and the disease code corresponding to the standard disease name are used as a first code matching result. And the disease names which can not be matched by the accurate matching model are used as candidate code matching disease names, and a second round of code matching is accepted.

And 204, inputting the obtained candidate code matching disease names into a fuzzy matching model, and performing code matching according to a standard disease classification table to obtain a second code matching result.

Specifically, the fuzzy matching model performs fuzzy matching on the candidate pair code disease names through similarity calculation, and takes the standard disease names capable of realizing fuzzy matching with the candidate pair code disease names and the disease codes corresponding to the standard disease names as second pair code results.

In one embodiment, the server may calculate similarities between the candidate code-matching disease names and the standard disease names in the standard disease classification table by using different types of fuzzy matching methods, determine standard disease names that implement fuzzy matching with the candidate code-matching disease names by combining the similarities calculated by the different types of fuzzy matching methods, and use the standard disease names and the disease codes corresponding to the standard disease names as the second code-matching result.

And step 205, generating a disease name pair code list according to the first pair code result and the second pair code result.

Specifically, the server combines the first pair of code results and the second pair of code results into a list, and in a new list, the electronic medical record identifier, the disease name, the standard disease name matched with the disease name, and the disease code corresponding to the standard disease name are correspondingly stored. For the disease name deleted in the deduplication process, the server takes the first pair code result or the second pair code result of the disease name associated with the deleted disease name as the pair code result of the deleted disease name, thereby obtaining a complete disease name pair code list.

In one embodiment, the server may also upload the generated disease name pair code list into the blockchain to improve privacy and security of the disease name pair code list.

In this embodiment, the list of names of diseases is first de-duplicated to reduce the amount of calculation; inputting the de-duplicated disease name list into an accurate matching model for accurate matching to obtain a first code matching result, inputting the disease names which cannot realize accurate matching as candidate code matching disease names into a fuzzy matching model for fuzzy matching to obtain a second code matching result, and performing code matching according to a standard disease classification table during two times of code matching; and finally, generating a disease name code matching list according to the first code matching result and the second code matching result, and performing multi-dimensional and multi-mode code matching on the disease name through precise matching and fuzzy matching, so that the accuracy of the disease name code matching is improved.

Further, the exact match model is composed of several ordered exact match submodels, as shown in fig. 3, and the step 203 may include:

step 2031, inputting the disease names in the de-duplicated disease name list to the exact match submodel according to the arrangement order of the exact match submodel in the exact match model.

Specifically, the exact match model may be composed of a plurality of different exact match submodels arranged in order, and the exact match submodels may perform simple text-level preprocessing on the input disease names first, and then perform exact matching. The preprocessing may be processing of characters or phrases in the names of diseases, such as correction of wrongly written characters, removal of duplicate characters or phrases, conversion of synonyms, removal of nonsense characters, and the like. Different precise matching submodels can carry out different text level preprocessing on the disease name; it will be appreciated that there may also be exact match submodels that do not preprocess the disease name.

And the server firstly inputs the disease names in the disease name list after duplication removal into the precise matching sub-model according to the arrangement sequence of the precise matching sub-models in the precise matching model.

Step 2032, the standard disease name matched with the inputted disease name is searched in the standard disease classification table through the current exact matching sub-model.

Specifically, the current precise matching sub-model preprocesses the input disease names according to a preprocessing program, obtains a standard disease classification table after preprocessing, compares the disease names with the standard disease names in the standard disease classification table one by one in sequence, and queries the standard disease names capable of being matched.

Step 2033, when the matched standard disease name is inquired, the inquired standard disease name and the disease code corresponding to the standard disease name are used as the first code matching result of the disease name.

Specifically, when a standard disease name matching the disease name is searched, the matching standard disease name and a disease code corresponding to the standard disease name are coded as a first pair result of the disease name.

And after the accurate matching sub-model finishes code matching of one disease name, processing the next input disease name. When the disease name can be matched by a certain exact match submodel, the processing of the disease name is finished, and the disease name is not matched by the rest exact match submodels.

Step 2034, when the matching standard disease name is not found in the current exact matching submodel, inputting the disease name to the next exact matching submodel to continue matching.

Specifically, if the current exact matching submodel fails to find the standard disease name matching the disease name in the standard disease classification table, the disease name is input to the next exact matching submodel according to the arrangement order of the exact matching submodel to continue to perform matching.

Step 2035, if the disease name is not matched by each exact match submodel, the disease name is marked as a candidate match disease name.

Specifically, when the exact match submodel cannot match the disease name, the disease name is input into the next exact match submodel for matching. And when the disease names cannot be matched by each precise matching submodel, marking the disease names as candidate code matching disease names.

In the embodiment, each disease name in the disease name list after duplication removal is input into the precise matching submodel according to the arrangement sequence of the precise matching submodel for matching, if matching is possible, a first code matching result is generated, and if matching is not possible, the next precise matching submodel is input to continue matching, the precise matching submodels are different, so that precise matching of the disease names from multiple dimensions is ensured, and the accuracy of the code matching of the disease names is improved.

In an embodiment, the step 203 may specifically include: and inputting each disease name in the disease name list after the duplication removal into the precise matching sub-model according to the arrangement sequence of the four precise matching sub-models in the precise matching model. The four accurate matching submodels comprise a complete matching submodel, a stop word submodel, a primary and secondary separation submodel and a synonymy recognition submodel.

Specifically, the precise matching model comprises four precise matching submodels which are a complete matching submodel, a stop word submodel, a primary and secondary separation submodel and a synonymous recognition submodel in sequence. The server first inputs each disease name in the list of disease names after duplication into the full-matching submodel.

Complete matching sub-model: the method is used for completely matching the disease names, comparing the input disease names with the standard disease names in the standard disease classification table in sequence, and determining that the disease names are completely matched with the standard disease names if the disease names are completely consistent with the standard disease names. The full match submodel encodes the matched standard disease name and the disease corresponding to the standard disease name as a first pair of code results. Disease names that are not matched by the full match submodel are input to the stop word submodel.

Go to stop word submodel: and preprocessing the stop words of the disease names and then matching the names. First, a landmark symbol (e.g., "; then accessing a pre-constructed medical disease special use stopping word bank, wherein numbers, direction words and some specific expressions are recorded in the medical disease special use stopping word bank; calling the medical disease-specific stop word bank to remove stop words in disease names (for example, if the disease name is 'left metatarsal fracture', then 'left') and then removing the stop words; and sequentially matching the disease names without stop words with the standard disease names in the standard disease classification table, wherein the matched standard disease names and the disease codes corresponding to the standard disease names are used as first pair code results. Unmatched disease names are input to the primary and secondary separation submodels.

Primary and secondary separator models: and performing primary and secondary disease separation pretreatment on the disease names and then matching. The disease names may be linked together by a plurality of disease names, and the primary and secondary separation submodels extract primary and secondary disease names. (for example, the disease name is "1. diabetes 2. hypertension", and the main disease name "diabetes" and the sub-disease name "hypertension" are extracted). And matching the main disease name and the secondary disease name with the standard disease names in the standard disease classification table in sequence to obtain a first pairing result. The primary disease name may be a first recognized disease name, and the secondary disease name may be a second recognized disease name. If the disease names are connected together by a plurality of disease names, a plurality of code matching results can be obtained, wherein the main disease name corresponds to the main code matching result, and the secondary disease name corresponds to the secondary code matching result. Disease names that are not matched by the primary and secondary separation submodels are input to the synonym identification submodel.

Synonym identifier model: and carrying out synonymy conversion on the disease names, and then matching. The synonym recognizer model accesses a pre-constructed synonym disease thesaurus that records different representations of the same body part, different representations of the same symptom, different representations of the same disease, and the like. The synonyms in the disease name are replaced by calling the synonym library, for example, the "malignant tumor" is replaced by the "cancer", the "hyperthyroidism" is replaced by the "hyperthyroidism", and the like. And matching the synonymously replaced disease names with the standard disease names in the standard disease classification table in sequence to obtain a first pairing result.

Disease names for which none of the four exact match submodels match are labeled as candidate match-code disease names.

It will be appreciated that the four exact match submodels described above may also be arranged in any order.

In this embodiment, the disease names are input to the precise matching submodels according to the arrangement sequence of the four precise matching submodels in the precise matching model, the four precise matching submodels are sequentially a complete matching submodel, a stop word submodel, a primary and secondary separation submodel and a synonymous identification submodel, and the disease names can be matched by adopting different methods according to the four precise matching submodels, so that the accuracy of matching codes of the disease names is improved.

Further, the fuzzy matching model is composed of a plurality of fuzzy matching submodels, as shown in fig. 4, the step 204 may specifically include:

step 2041, inputting the obtained candidate code matching disease names into each fuzzy matching sub-model in the fuzzy matching model.

Specifically, the fuzzy matching model may be composed of a plurality of fuzzy matching submodels, the candidate pair code disease name is input to each fuzzy matching submodel in the fuzzy matching model, and each fuzzy matching submodel may match the candidate pair code disease name by a different fuzzy matching method.

In one embodiment, the step of inputting the obtained candidate match code disease name into each fuzzy matching sub-model in the fuzzy matching model specifically includes: and inputting the obtained candidate code matching disease names into four fuzzy matching submodels in the fuzzy matching model, wherein the four fuzzy matching submodels comprise a word frequency matching submodel, an N-Gram submodel, an editing distance submodel and a cosine calculation submodel.

Specifically, the fuzzy matching model consists of four fuzzy matching submodels, wherein the four fuzzy matching submodels comprise a word frequency matching submodel, an N-Gram submodel, an editing distance submodel and a cosine calculation submodel. Each candidate pair code disease name is input into four fuzzy matching submodels to carry out different fuzzy matching.

The word frequency matching sub-model resolves the candidate pair code disease names and each standard disease name in the standard disease classification table into a single character set (for example, "diabetes" is resolved into { "sugar", "urine", "disease" }). The Jaccard coefficient (the Jaccard index is also called as a Jaccard similar coefficient and is used for comparing the similarity and the difference between limited sample sets, the larger the Jaccard coefficient value is, the higher the sample similarity) is used as the similarity between the character set of the candidate code-checking disease name and each standard disease name character set, and a control parameter is added in the calculation process for smoothing operation.

For example, the candidate code-pair disease name is "diabetes", and the word-frequency matching submodel calculates the Jaccard coefficients of the "diabetes" and 26000 standard disease names in ICD-10 one by one, and the calculation formula is as follows:

wherein, A is a character set of candidate code-matching disease names, and B is a character set of standard disease names; jaccard (A, B) is the similarity between A and B; lenA represents the length of the set A, namely the number of characters in the set A; lenB denotes the length of set B, i.e. the number of characters in set B; len (a ≧ B) denotes the number of the same character in the set a and the set B, α and β are control parameters, and the control parameters are set manually, for example, α may be set to 1 and β may be set to 0.5.

Then the Jaccard for the candidate registration disease name "diabetes" and the standard disease name "diabetic foot" are calculated as:

N-Gram submodel: N-Gram (also called N-Gram) is commonly used for natural language processing, the N-Gram of a text represents a phrase obtained by segmenting the text according to the length N, and the value of N is generally 2 or 3. The N-Gram submodel resolves candidate pair-code disease names and standard disease names into sets of phrases, respectively, e.g., "diabetes" as { "$ sugar", "diabetes", "urine disease", "illness $" }, where $ is a fill character. And then calculating the similarity between the phrase set of the candidate code-matching disease names and the phrase set of each standard disease name according to the following formula:

wherein, M is a phrase set of candidate code-matching disease names, and N is a phrase set of standard disease names; jaccard (M, N) is the similarity between M and N; lenM represents the length of the set M, i.e. the number of phrases in the set M; lenN represents the length of the set N, i.e., the number of phrases in the set N; len (M N) represents the number of the same phrase in the set M and the set N, and the number is used as a control parameter which is set manually.

Editing the distance submodel: the method is used for calculating the Levenshtein distance between the candidate code-matching disease name and each standard disease name, and the similarity is higher when the distance is smaller.

The Levenshtein distance (also called text edit distance) refers to the minimum number of operations required to convert one string into another, including insertion, deletion, and replacement.

For example: converting "eeba" to "abac":

eeba (delete first e) → eba

eba (replace the remaining e by a) → aba

aba (insert c at the end) → abac

The Levenshtein distance for "eeba" and "abac" is 3.

Cosine calculation submodel: the cosine computation submodel needs to be trained first. Firstly, a medical corpus is constructed by crawling medically related data from a network (for example, the data of Wikipedia, Baidu encyclopedia and medical encyclopedia is crawled to construct the medical corpus), and a Word2Vec model is trained by the crawled data, wherein the Word2Vec model is a model for generating Word vectors. And the cosine calculation sub-model firstly divides the candidate opposite code disease names into words, then converts the Word division part into Word vectors by using the trained Word2Vec model, and calculates the vector average value of each Word vector as the disease name vector. For example, the disease name vector for the disease name "upper respiratory tract infection" followed by the word "upper", "respiratory tract" and "infection" may be represented by the average of the word vectors for the three words "upper", "respiratory tract" and "infection". Likewise, a disease name vector for the standard disease name is calculated. The disease name vectors are reduced in dimension through a PCA model, and the disease name vectors are translated into a region with an origin as a center so as to increase the difference between the vectors. And calculating the cosine similarity between the disease name vectors of the candidate code-matching disease names corrected by PCA and the disease name vectors of the standard disease names. Among them, PCA (principal component analysis, also called principal component analysis) is mainly used for data dimensionality reduction.

In the embodiment, the candidate code matching disease names are input into four fuzzy matching submodels in the fuzzy matching model, the four fuzzy matching submodels are a word frequency matching submodel, an N-Gram submodel, an editing distance submodel and a cosine calculation submodel, and each fuzzy matching submodel matches the candidate code matching disease names, so that the accuracy of code matching of the candidate code matching disease names is ensured.

It can be understood that the database of the server stores the standard disease classification table, and the storage addresses of the standard disease classification table are prestored in each precise matching sub-model and each fuzzy matching sub-model; and when matching is carried out, each accurate matching sub-model and each fuzzy matching sub-model acquire a standard disease classification table according to the storage address, and matching is carried out according to the standard disease classification table.

Step 2042, based on each fuzzy matching submodel, calculating the similarity between the candidate code-matching disease name and each standard disease name in the standard disease classification table.

Specifically, for each fuzzy matching submodel, the similarity between the input candidate code-matching disease name and each standard disease name in the standard disease classification table is calculated.

In one embodiment, when the fuzzy matching sub-model is the edit distance sub-model, the step of calculating the similarity between the candidate match-code disease name and each standard disease name in the standard disease classification table specifically includes: calculating the text editing distance between the candidate code-matching disease name and each standard disease name in the standard disease classification table; and normalizing each text editing distance, and taking each text editing distance after normalization as the similarity between the candidate code-matching disease name and each standard disease name.

Specifically, for each candidate code-matching disease name, the edit distance sub-model calculates the text edit distance between the candidate code-matching disease name and each standard disease name. The text editing distance is an integer, and the smaller the text editing distance is, the higher the representative similarity is; in order to subsequently perform operation with the similarity obtained by calculation of other fuzzy matching submodels, the text editing distance needs to be normalized, the numerical value of the text editing distance is compressed to an interval [0,1], and the normalized text editing distance is used as the similarity between the candidate code-matching disease name and each standard disease name.

The edit distance submodel may normalize the text edit distance by linear normalization, normalized normalization, non-linear normalization, and the like.

In this embodiment, the edit distance submodel calculates text edit distances between the candidate pair code disease names and the standard disease names, and uses the normalized text edit distance as the similarity between the candidate pair code disease names and the standard disease names, so as to ensure that a second pair code result can be generated in combination with the similarity calculated by the remaining fuzzy matching submodels.

And 2043, generating a second pair of code results according to the similarity obtained by calculating each fuzzy matching submodel.

Specifically, from the similarity obtained by the fuzzy matching sub-model calculation, the server may select the standard disease name and the disease code thereof corresponding to the highest similarity as the sub-pair code result of the fuzzy matching sub-model. And taking the sub-pair code result with the most occurrence times in the sub-pair code results of the fuzzy matching sub-models as a second pair code result.

In one embodiment, each fuzzy matching sub-model is pre-set with a corresponding weight. And after each sub-pair code result is obtained, calculating the weight of each type of sub-pair code result according to the weight of each fuzzy matching sub-model, and selecting the sub-pair code result with the highest weight ratio as a second pair code result. For example, assume that there are 4 fuzzy matching submodels, where the sub-pair code results of two fuzzy matching submodels are both X, and the sub-pair code results of two fuzzy matching submodels are both Y; and the weights of the two fuzzy matching submodels with the sub-pair code result of X are both 0.2, the weights of the two fuzzy matching submodels with the sub-pair code result of Y are both 0.3, the weight ratio (0.6) of Y is greater than the weight ratio (0.4) of X, and Y is taken as a second pair code result.

In one embodiment, the step of generating the second pair code result according to the similarity calculated by each fuzzy matching sub-model specifically includes: for each candidate code matching disease name, screening a standard disease name and a disease code corresponding to the maximum similarity from the similarity obtained by calculating each fuzzy matching sub-model, and carrying out HardVoting fusion to obtain a second code matching result; or performing SoftVoting fusion according to the similarity obtained by calculating each fuzzy matching model to obtain a second pair code result.

The HardVoting fusion is to select the standard disease name and the disease code corresponding to the highest similarity from the similarity obtained by calculating each fuzzy matching submodel, and determine a second code matching result according to a few rules subject to majority; the softvoing fusion is to calculate the average value of the similarity between the candidate code-pair disease name output by each fuzzy matching sub-model and each standard disease name, and select the standard disease name with the highest average similarity and the disease code thereof as the second code-pair result.

When HardVoting fusion is adopted, for each candidate code matching disease name, the standard disease name with the highest similarity and the corresponding disease code which are obtained by calculation of each fuzzy matching submodel are firstly selected to obtain a plurality of groups of sub code matching results, and the standard disease name with the most occurrence times and the corresponding disease code in the plurality of groups of sub code matching results are used as a second code matching result.

For example, in the calculation result of the word frequency matching submodel, the similarity between the candidate code-matching disease name and the peripheral neuropathy is the highest and is 90%; the N-Gram submodel is peripheral neuropathy, and the similarity is 80 percent; editing the distance sub-model as 'peripheral neuropathy', wherein the similarity is 100%; the cosine computer model is "peripheral neuritis" with a similarity of 85%. In the four-group pair code results, the peripheral neuropathy occurs 3 times, the peripheral neuritis occurs 1 time, the occurrence frequency of the peripheral neuropathy is greater than that of the peripheral neuritis, and the peripheral neuropathy and the corresponding disease code are used as a second pair code result.

And when the softVoting fusion is adopted, for each candidate code matching disease name, the similarity between each fuzzy matching sub-model and all standard disease names is obtained. When there are four fuzzy matching submodels, 4 × 26000 similarity degrees are obtained. And then, combining the results of the fuzzy matching submodels, calculating the weighted average of the similarity of each standard disease name and the candidate code-matching disease name, and taking the standard disease name with the highest average similarity and the disease code thereof as a second code-matching result.

By way of example (each fuzzy matching sub-model lists only two standard disease names as examples):

word frequency matching submodel: peripheral neuropathy-similarity 99%; peripheral neuritis-similarity 1%;

N-Gram submodel: peripheral neuropathy-similarity 49%; peripheral neuritis-similarity 51%;

editing the distance submodel: peripheral neuropathy-similarity 40%; peripheral neuritis-similarity 60%;

cosine calculation submodel: peripheral neuropathy-similarity 90%; peripheral neuritis-similarity 10%;

when the weights of the fuzzy matching submodels are the same, the following steps are provided:

weighted average of "peripheral neuropathy" similarity: (99% + 49% + 40% + 90%)/4 ═ 69.5%;

weighted average of "peripheral neuritis" similarity: (1% + 51% + 60% + 10%)/2 ═ 30.5%;

and if the weighted average of the similarity of the peripheral neuropathy is larger than the weighted average of the similarity of the peripheral neuritis, the peripheral neuropathy and the corresponding disease code are used as a second pair of code results.

In this embodiment, hardvoing fusion or softvoing fusion is performed according to the similarity calculated by each fuzzy matching sub-model, and the result of each fuzzy matching sub-model is taken into account, so as to generate a second pair of code results, thereby improving the accuracy of generating the second pair of code results.

In the embodiment, the candidate code matching disease names are input into each fuzzy matching submodel in the fuzzy matching model, each fuzzy matching submodel adopts different methods to calculate the similarity between the candidate code matching disease names and each standard disease name, and then the similarity obtained by calculation of each fuzzy matching submodel is combined to generate a second code matching result, so that the accuracy of the candidate code matching disease names in code matching is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a disease name code matching apparatus, which corresponds to the embodiment of the method shown in fig. 2 and can be applied to various electronic devices.

As shown in fig. 4, the disease name code matching apparatus 300 according to the present embodiment includes: a list acquisition module 301, a list deduplication module 302, an exact match module 303, a fuzzy match module 304, and a list generation module 305, wherein:

the list acquiring module 301 is configured to acquire a disease name list from an electronic medical record.

The list deduplication module 302 is configured to perform deduplication processing on duplicate disease names in the disease name list to obtain a deduplicated disease name list.

And the precise matching module 303 is configured to input the duplicate-removed disease name list into a precise matching model, and perform code matching according to the standard disease classification table to obtain a first code matching result and candidate code matching disease names.

And the fuzzy matching module 304 is configured to input the obtained candidate code matching disease names into a fuzzy matching model, and perform code matching according to a standard disease classification table to obtain a second code matching result.

A list generating module 305, configured to generate a disease name pair list according to the first pair result and the second pair result.

In some optional implementations of this embodiment, the exact matching module 303 includes: the name input submodule, the name inquiry submodule, the first generation submodule and the name marking submodule, wherein:

and the name input submodule is used for inputting each disease name in the de-duplicated disease name list to the precise matching sub-model according to the arrangement sequence of the precise matching sub-models in the precise matching model.

And the name query submodule is used for querying the standard disease name matched with the input disease name in the standard disease classification table through the current accurate matching submodel.

And the first generation sub-module is used for coding the inquired standard disease name and the disease corresponding to the standard disease name as a first code matching result of the disease name when the matched standard disease name is inquired.

And the name input sub-module is also used for inputting the disease name to the next precise matching sub-model to continue to perform matching when the matched standard disease name is not inquired by the current precise matching sub-model.

And the name marking submodule is used for marking the disease name as a candidate code matching disease name if the disease name is not matched by each accurate matching submodel.

In some optional implementations of this embodiment, the name input sub-module is further configured to: inputting each disease name in the disease name list after the duplication removal into an accurate matching sub-model according to the arrangement sequence of four accurate matching sub-models in the accurate matching sub-model; the four accurate matching submodels comprise a complete matching submodel, a stop word submodel, a primary and secondary separation submodel and a synonymy recognition submodel.

In some optional implementations of the present embodiment, the fuzzy matching module 304 includes: the input submodule, the calculation submodule and the second generation submodule, wherein:

and the input submodule is used for inputting the obtained candidate code matching disease names into each fuzzy matching submodel in the fuzzy matching model.

And the calculating submodule is used for calculating the similarity between the candidate code matching disease name and each standard disease name in the standard disease classification table based on each fuzzy matching submodel.

And the second generation submodule is used for generating a second pair code result according to the similarity obtained by calculating each fuzzy matching submodel.

In some optional implementations of the present embodiment, the input submodule is further configured to: and inputting the obtained candidate code matching disease names into four fuzzy matching submodels in the fuzzy matching model, wherein the four fuzzy matching submodels comprise a word frequency matching submodel, an N-Gram submodel, an editing distance submodel and a cosine calculation submodel.

In some optional implementation manners of this embodiment, when the fuzzy matching sub-model is the edit distance sub-model, the calculating sub-module includes: a distance calculation unit and a distance normalization unit, wherein:

the distance calculation unit is used for calculating the text editing distance between the candidate code matching disease name and each standard disease name in the standard disease classification table;

and the distance normalization unit is used for normalizing each text editing distance and taking each text editing distance after normalization as the similarity between the candidate code-matching disease name and each standard disease name.

In some optional implementations of this embodiment, the second generating sub-module includes: a HardVoting unit or a SoftVoting unit, wherein:

and the HardVoting unit is used for screening the standard disease name and the disease code corresponding to the maximum similarity from the similarity obtained by calculating each fuzzy matching sub-model for each candidate code matching disease name to carry out HardVoting fusion to obtain a second code matching result.

And the softvoice unit is used for performing softvoice fusion according to the similarity obtained by calculating each fuzzy matching model to obtain a second pair of code results.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable gate array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a flash Card (FlashCard), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes of a disease name-to-code method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, for example, execute the program code of the disease name-to-code method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The computer device provided in this embodiment may perform the steps of the above-described disease name code matching method. Here, the steps of the disease name code matching method may be the steps of the disease name code matching method of each of the above embodiments.

The present application further provides another embodiment, which is a computer-readable storage medium storing a disease name pair code program, the disease name pair code program being executable by at least one processor to cause the at least one processor to perform the steps of the disease name pair code method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

22页详细技术资料下载

Disease name code matching method and device, computer equipment and storage medium

相关技术

网友询问留言