Standard entity text determination method and device based on BilSTM model and storage medium

文档序号：1953456 发布日期：2021-12-10 浏览：17次中文

阅读说明：本技术 基于BiLSTM模型的标准实体文本确定方法、装置及存储介质 (Standard entity text determination method and device based on BilSTM model and storage medium ) 是由文天才周雪忠诸强李明洋于 2021-08-31 设计创作，主要内容包括：本发明提供基于BiLSTM模型的标准实体文本确定方法、装置及存储介质,方法包括：针对接收到的待匹配文本实体,选取与其对应的候选实体集；针对候选实体集中的每一候选实体,分别与待匹配文本实体构成文本实体对；针对每个文本实体对,采用预设神经匹配神经网络计算文本实体对的第一相似度特征向量,及采用文本统计方法、全连接网络计算文本实体对的第二相似度特征向量；采用拼接网络将每个文本实体对的第一相似度特征向量与第二相似度特征向量拼接形成每个实体对的相似度向量,并根据每个文本实体对的相似度向量输出每个实体对中两个实体文本的相似度；将相似度最高的文本实体对中的候选文本实体确定为与待匹配文本实体对应的标准文本实体。(The invention provides a method, a device and a storage medium for determining standard entity texts based on a BilSTM model, wherein the method comprises the following steps: aiming at a received text entity to be matched, selecting a candidate entity set corresponding to the text entity to be matched; aiming at each candidate entity in the candidate entity set, respectively forming a text entity pair with the text entity to be matched; aiming at each text entity pair, calculating a first similarity characteristic vector of the text entity pair by adopting a preset neural matching neural network, and calculating a second similarity characteristic vector of the text entity pair by adopting a text statistical method and a full-connection network; splicing the first similarity characteristic vector and the second similarity characteristic vector of each text entity pair by adopting a splicing network to form a similarity vector of each entity pair, and outputting the similarity of two entity texts in each entity pair according to the similarity vector of each text entity pair; and determining the candidate text entity in the text entity pair with the highest similarity as the standard text entity corresponding to the text entity to be matched.)

1. A method for determining standard entity text based on a BilSTM model is characterized by comprising the following steps:

aiming at a received text entity to be matched, selecting a candidate entity set corresponding to the text entity to be matched;

aiming at each candidate entity in the candidate entity set, respectively forming a text entity pair with the text entity to be matched;

aiming at each text entity pair, calculating a first similarity characteristic vector of the text entity pair by adopting a preset neural matching neural network, and calculating a second similarity characteristic vector of the text entity pair by adopting a text statistical method and a full-connection network;

splicing the first similarity characteristic vector and the second similarity characteristic vector of each text entity pair by adopting a splicing network to form a similarity vector of each entity pair, and outputting the similarity of two entity texts in each entity pair according to the similarity vector of each text entity pair;

and determining the candidate text entity in the text entity pair with the highest similarity as the standard text entity corresponding to the text entity to be matched.

2. The method for determining standard entity text based on the BilSTM model as claimed in claim 1, wherein said calculating the first similarity feature vector of the text entity pair by using the preset neural matching neural network comprises:

sequentially coding a text entity to be matched and a candidate entity in a text entity pair by respectively adopting an RNN model and a CNN neural network to respectively form a text entity RNN code to be matched, a text entity CNN code to be matched, a candidate entity RNN code and a candidate entity CNN code;

calculating the forward attention weight of the text entity RNN code to be matched relative to the candidate entity RNN code and the reverse attention weight of the candidate entity RNN code relative to the text entity RNN code to be matched;

determining a candidate entity maximum pooling vector, a candidate entity average pooling vector, a text entity maximum pooling vector to be matched and a text entity average pooling vector to be matched according to the forward attention weight, the reverse attention weight, the candidate entity RNN code, the text entity RNN code to be matched, the candidate entity CNN code to be matched and the text entity CNN code to be matched;

based on a full-connection network, determining the similarity between the text to be matched and the candidate text in the text entity pair according to the candidate entity maximum pooling vector, the candidate entity average pooling vector, the text entity maximum pooling vector to be matched and the text entity average pooling vector to be matched, and determining the corresponding first feature vector according to the similarity of each text entity pair.

3. The method of claim 1, wherein the loss function of the neural network is:

4. The method of claim 1, wherein the entity to be matched is a text entity, an abbreviated english entity or a mixed chinese and english entity.

5. The method of claim 4, wherein if the text to be matched is a text entity, selecting a candidate entity set corresponding to the received text entity to be matched according to the text entity to be matched, the method comprises:

calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database;

selecting entities to be candidate with Jaccard coefficient not greater than a preset value to form an entity set to be candidate;

and screening the entities to be candidate with the same semantic as the text entities to be matched from the entity sets to be candidate to form a candidate entity set.

6. The method of claim 5, wherein the calculating Jaccard coefficients of the text entity to be matched and the candidate entity stored in the database comprises:

calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database by adopting a first mathematical model; the first mathematical model is:

wherein, A is a set of characters or letters forming an entity to be matched; b is_iTo form the ith candidate entityA collection of words or letters.

7. The method as claimed in claim 4, wherein if the entity to be matched is an abbreviated english entity or a mixed chinese-english entity, the selecting a candidate entity set corresponding to the received text entity to be matched comprises:

and adopting a trained third neural network, taking the entity to be matched as input, taking the candidate entity matched with the text to be matched as output, and forming a candidate entity set with the candidate entity corresponding to the same entity to be matched.

8. A device for determining standard entity text based on a BilSTM model, comprising:

the selection module is used for selecting a candidate entity set corresponding to the received text entity to be matched;

the team forming module is used for aiming at each candidate entity in the candidate entity set and respectively forming a text entity pair with the text entity to be matched;

the feature vector module is used for calculating a first similarity feature vector of each text entity pair by adopting a preset neural matching neural network and calculating a second similarity feature vector of each text entity pair by adopting a text statistical method and a full-connection network;

the similarity module is used for splicing the first similarity characteristic vector and the second similarity characteristic vector of each text entity pair by adopting a splicing network to form a similarity vector of each entity pair, and outputting the similarity of two entity texts in each entity pair according to the similarity vector of each text entity pair;

and the entity determining module is used for determining the candidate text entity in the text entity pair with the highest similarity as the standard text entity corresponding to the text entity to be matched.

9. A device for determining standard entity text based on a BilSTM model, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform a method for determining a standard entity text based on a BilSTM model according to any of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement a method for BiLSTM model-based standard entity text determination as claimed in any one of claims 1-7.

Technical Field

The invention relates to the technical field of natural language text information processing and medical big data mining, in particular to a method, a device and a storage medium for determining standard entity texts based on a BilSTM model.

Background

The ambiguity problem of entity name exists in the natural language processing process, and the medical disease diagnosis record contains information such as main disease name diagnosed by a patient, secondary diagnosis disease name (accompanying disease name) and operation for diagnosing disease. For the same disease name, due to the various disease types, different experience of doctors and the like, the same disease name often has various different expression forms, and great challenges are brought to the standardization of medical electronic medical record data. However, since medical texts are mainly input by a doctor through handwriting at present, input errors inevitably occur, and difficulty occurs in matching the wrong terms with the standard terms. Meanwhile, on the other hand, the same disease diagnosis entity, due to the diversified expression of people, will also generate a great number of irregular candidate names, such as "Bartter syndrome" and "pararenal glomerular hyperplasia disease", which may be literally recognized as two completely different clinical disease entities, but it can be determined from the medical standard knowledge base that they should uniquely correspond to the standard entity "barter syndrome".

In order to solve the above problem, it is often performed manually if the normalization process is performed on a small amount of data. But for a large number of terms that need to be processed, it is time-consuming and labor-consuming. The task of candidate entity disambiguation is to establish a mapping relationship between a given entity designation in the text (referring to the name of an entity in an article or domain) and the corresponding entity in the knowledge base. The candidate entity disambiguation aims at solving the problem of name ambiguity widely existing in texts, plays an important role in natural language processing application, and can be used for effectively solving natural processing tasks such as semantic network, information retrieval, information extraction, automatic question answering and the like. Therefore, standardization of medical terms based on computer models is an effective means to address the disambiguation of large-scale medical candidate entities.

At present, the algorithm related to candidate entity disambiguation is mostly based on English, Chinese research is relatively deficient, and the research on the problem of medical entity disambiguation is less seen, but for the disambiguation task of disease entities in the medical field, because of the diversity of disease names and incomplete diagnosis information, a common candidate entity disambiguation method cannot be directly utilized, and if the standard entity text in the medical entity standardization task is determined by utilizing the existing candidate entity disambiguation method, the applicability is relatively low, the difference between the obtained result and the actual standard entity text is large, the accuracy of the standard entity text determination result is relatively low, and the efficiency is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, and a storage medium for determining a standard entity text based on a BiLSTM model, so as to solve the problems that when determining a standard entity text in a standardized task of a medical entity by using an existing candidate entity disambiguation method, the applicability is relatively low, the difference between an obtained result and an actual standard entity text is large, and the accuracy of a result determined by the standard entity text is relatively low and the efficiency is not high.

In a first aspect, a method for determining a standard entity text based on a BiLSTM model according to an embodiment of the present invention includes:

aiming at a received text entity to be matched, selecting a candidate entity set corresponding to the text entity to be matched;

aiming at each candidate entity in the candidate entity set, respectively forming a text entity pair with the text entity to be matched;

and determining the candidate text entity in the text entity pair with the highest similarity as the standard text entity corresponding to the text entity to be matched.

Preferably, the calculating a first similarity feature vector of the text entity pair by using a preset neural matching neural network includes:

Preferably, the loss function of the preset neural matching neural network is as follows:

wherein the lost input is an input entity pair x_iAnd x_j，f_iAnd f_jRespectively representing vectors of the input entity pair after encoding and mapping; m represents a distance boundary value between input samples, and is a hyper-parameter set in advance; and y is the label of the input. When y is_iIs not equal to y_jAnd if the input entities are not matched, the loss function value is the right half part of the formula, and the Euclidean distance of the sample pair is larger, the loss value is larger.

Preferably, the entity to be matched is a text entity, an English abbreviation entity or a Chinese-English mixed entity.

Preferably, if the text to be matched is a text entity, the selecting a candidate entity set corresponding to the received text entity to be matched includes:

calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database;

selecting entities to be candidate with Jaccard coefficient not greater than a preset value to form an entity set to be candidate;

and screening the entities to be candidate with the same semantic as the text entities to be matched from the entity sets to be candidate to form a candidate entity set.

Preferably, the calculating the Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database includes:

calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database by adopting a first mathematical model; the first mathematical model is:

wherein, A is a set of characters or letters forming an entity to be matched; b is_iIs the set of words or letters that constitute the ith entity to be candidate.

Preferably, if the entity to be matched is an abbreviated english entity or a mixed chinese and english entity, the selecting a candidate entity set corresponding to the received text entity to be matched includes:

In a second aspect, an apparatus for determining a standard entity text based on a BiLSTM model according to an embodiment of the present invention includes:

the selection module is used for selecting a candidate entity set corresponding to the received text entity to be matched;

the team forming module is used for aiming at each candidate entity in the candidate entity set and respectively forming a text entity pair with the text entity to be matched;

In a third aspect, an apparatus for determining a standard entity text based on a BiLSTM model according to an embodiment of the present invention includes: a memory and a processor, the memory and the processor are communicatively connected with each other, the memory stores computer instructions, and the processor executes the computer instructions to execute a standard entity text determination method based on the BilSt model.

In a fourth aspect, a non-transitory computer-readable storage medium is provided according to an embodiment of the present invention, and stores computer instructions, which when executed by a processor, implement any one of the above methods for determining standard entity text based on a BiLSTM model.

The method, the device and the storage medium for determining the standard entity text based on the BilSTM model provided by the embodiment of the invention at least have the following beneficial effects:

according to the standard entity text determining method, device and storage medium based on the BilSTM model, the candidate entity set corresponding to the received text entity to be matched can be selected, each candidate entity in the candidate entity set and the text entity to be matched respectively form a text entity pair, model training is performed on the text entity pairs, the use rate of text data is improved, the situations that training data is single and a training result is inaccurate due to the fact that only terms are trained per se are avoided, and applicability is improved. Training a text entity pair through a preset neural matching neural network, synchronously training each data in the text entity pair through a twin network architecture model of the preset neural matching neural network, conveniently measuring semantic similarity of the text entity pair through a similarity vector mode to obtain a first similarity feature vector, easily splicing and integrating encoded vectors later, reducing complexity of the model, improving training efficiency, and calculating through a text statistical method and a full-connection network to obtain a second similarity feature vector of the text entity pair; and splicing and integrating the first similarity characteristic vector and the second similarity characteristic vector, calculating the similarity of each text entity pair, comparing the similarity of each text entity pair, and determining the standard text corresponding to the text entity pair with the maximum similarity. The consistency rate of the standard entity corresponding to the entity text pair with the maximum similarity pair and the actual standard text result is higher, the accuracy of the standard text entity result in the medical entity standardization task is improved, and the efficiency of the standard entity text determination is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for determining a standard entity text based on a BiLSTM model according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for determining a standard entity text based on a BilSTM model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining a standard entity text based on a BilSTM model according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for determining a standard entity text based on a BilSTM model according to an embodiment of the present invention;

FIG. 5 is a diagram of a BiLSTM-based twin network model according to an embodiment of the present invention;

FIG. 6 is a model diagram of a fusion attention mechanism based on a twin network architecture according to an embodiment of the present invention;

FIG. 7 is a diagram of a fusion depth matching model according to an embodiment of the present invention;

fig. 8 is a partial english abbreviation comparison diagram provided in the embodiment of the present invention;

FIG. 9 is a block diagram of a standard entity text determination apparatus based on the BilSTM model according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a standard entity text determination apparatus based on a BiLSTM model according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

In the candidate entity recall stage, a candidate entity set is constructed according to the literal similarity of the character strings, the text statistical characteristics, the search of an Elasticissearch engine and other modes, and the text matching at the stage is just equivalent to coarse screening. The non-standard text data has diversified expression modes, different words in the Chinese text can have the same meaning, and some diagnosis original words with similar expression have different word sequences. The text matching mode used in the coarse screening process has low accuracy and cannot meet the requirements of people, so that semantic similarity matching can be performed on the text matching mode and the text matching mode. In the candidate entity disambiguation stage, the entity standardization quality can be improved by using the text semantic matching information.

At present, semantic similarity matching based on deep learning mainly comprises two frameworks, namely a Simese twin network which uses a symmetric network shared by parameters to model an input entity pair, and an interactive matching framework which is generally more complex but strengthens the learning of interactive representation between the input text entity pair. Based on the text matching ideas in the two frameworks, the semantic similarity matching and classification model adaptive to the clinical term standardization task is provided.

The model based on BERT is widely applied to the field of text matching at present, but the parameter size and inference time cost of the model are huge, and the cost is not negligible in the actual production environment, so that the application considers the BilSTM network-based model firstly and considers the twin network architecture firstly. Researchers such as Paul et al have used twin network architecture based models to solve the standardization problem, which are oriented to recruit job title entities in websites, for example, mapping the string "Software engine Java/J2 EE" as needed to a pre-specified code "Java Developer". The present application introduces this framework into medical substantiation. The general flow of the twin network model is to encode a pair of input entities respectively, ensure the sharing of model parameters of the two entities simultaneously, thus reducing the complexity of the model, then splice and integrate the encoded vectors, and finally measure the semantic similarity of the two input entities by a similarity measurement mode.

The twin network structure based on the BilSTM is shown in FIG. 5 and mainly comprises an input layer, a word embedding representation layer, a coding layer, an integration layer and a similarity prediction layer. The input of the model is an entity pair of a diagnosis original word m and a diagnosis standard encoding word e, embedded vectors of the diagnosis original word m and the diagnosis standard encoding word e are obtained through a word embedding representation layer, and the embedded vectors are obtained by utilizing a word vector model trained by Li et al based on text data of Baidu encyclopedia. In the coding layer, BilSTM is used, the coder firstly codes the input into a characteristic vector, and the transmission state of the bidirectional LSTMs comprises a front direction and a rear direction, so that the information of the medical entity can be better captured.

The embodiment of the invention provides a method for determining a standard entity text based on a BilSTM model, which is shown in a figure 1, a figure 6 and a figure 7 and comprises the following steps:

step S11, aiming at the received text entities to be matched, selecting a candidate entity set corresponding to the text entities to be matched; there are a number of conventional algorithms for calculating text similarity for generating a set of candidate entities. The method comprises the following steps: a traditional text statistics-based method, such as a Dice distance algorithm for fusion co-occurrence evaluation; mapping the text to a vector space, and then utilizing methods such as cosine similarity calculation and the like; similarity calculation method based on edit distance, character level and character string sequence. In order to extract richer and comprehensive feature information, the embodiment utilizes the result obtained by using the traditional statistical feature method in the candidate entity recall stage, and the result is firstly coded as the traditional text feature information and then introduced into the deep matching network model. And selecting a candidate entity set corresponding to the received text entity to be matched.

Step S12, aiming at each candidate entity in the candidate entity set, respectively forming a text entity pair with the text entity to be matched;

step S13, aiming at each text entity pair, calculating a first similarity feature vector of the text entity pair by adopting a preset neural matching neural network, and calculating a second similarity feature vector of the text entity pair by adopting a text statistical method and a full-connection network; for each candidate entity in the input candidate entity set, a text entity pair is respectively formed with a text entity to be matched, on one hand, a general deep learning semantic matching model is continuously used to extract semantic matching features of the input entity pair, and the model can use any deep semantic matching model to obtain a feature vector v₁. On the other hand, the traditional text features refer to the traditional feature information of the diagnosis original words and the standard coded text pairs, extraction and normalization are completed, the results are stored after offline calculation, and the traditional language features comprise cosine similarity, Jacard similarity and BM25 similarity based on vector space. Then normalizing the results through a full connection layer to obtain a nonlinear feature vector v₂。

Step S14, splicing the first similarity eigenvector and the second similarity eigenvector of each text entity pair by adopting a splicing network to form a similarity vector of each entity pair, and outputting the similarity of the two entity texts in each entity pair according to the similarity vector of each text entity pair; feature vector v₁And v₂The direct splicing mode can be adopted, and then a full connection layer is passed. And finally, classifying, and obtaining a semantic similarity scoring result through a softmax layer. In this embodiment, the result of entity normalization is evaluated by using an accuracy (accuracycacy), which refers to a ratio of the number of combinations of diagnostic original words and standard codes giving correct predictions to the size of a set to be predicted in a test set, and the accuracy formula is as follows:

wherein f represents the model, n represents the size of the test set, D is the test set, and label is the label.

And step S15, determining the candidate text entity in the text entity pair with the highest similarity as the standard text entity corresponding to the text entity to be matched.

In the general entity linking problem, there is often enough relevant description and context information for the entity to be standardized or the standard encoding part, which can be used to help the training of the model. However, in the standardization task of the medical entity, most texts in the experimental data only have the term name per se, and no other information can be provided for use. The method comprises the steps of selecting a candidate entity set corresponding to a received text entity to be matched, aiming at each candidate entity in the candidate entity set, forming a text entity pair with the text entity to be matched respectively, carrying out model training on the text entity pair, improving the utilization rate of text data, avoiding the conditions of single training data and inaccurate training result caused by training only terms per se, and improving the applicability. Training a text entity pair through a preset neural matching neural network, synchronously training each data in the text entity pair through a twin network architecture model of the preset neural matching neural network, conveniently measuring semantic similarity of the text entity pair through a similarity vector mode to obtain a first similarity feature vector, easily splicing and integrating encoded vectors later, reducing complexity of the model, improving training efficiency, and calculating through a text statistical method and a full-connection network to obtain a second similarity feature vector of the text entity pair; and splicing and integrating the first similarity characteristic vector and the second similarity characteristic vector, calculating the similarity of each text entity pair, comparing the similarity of each text entity pair, and determining the standard text corresponding to the text entity pair with the maximum similarity. The consistency rate of the standard entity corresponding to the entity text pair with the maximum similarity pair and the actual standard text result is higher, the accuracy of the standard text entity result in the medical entity standardization task is improved, and the efficiency of the standard entity text determination is improved.

In combination with the above embodiments, in the embodiments of the present invention, the twin network framework focuses on modeling the basic information of each sentence, and ignores the interaction between two sentences in the encoding process. Based on this, the embodiment provides a semantic similarity matching model fusing the characteristics of representation learning and interactive learning. The model uses a twin network based on multiple layers of CNN to extract key information in texts from input entity pairs, and adopts RNNs based on an attention mechanism to capture the interaction effect between two sentences. Compared with the traditional sequential coding, the CNNs based on the twin network are introduced, so that the calculation complexity can be reduced, more fine-grained characteristics can be captured, then the CNNs and the RNNs are combined, the similarity and the difference between two entities can be better grasped, finally, a fusion layer is designed, and the representation of the two input entities is combined to calculate the final similarity. Referring to fig. 2 and 6, in step S13, the calculating a first similarity feature vector of the text entity pair by using the preset neural matching neural network includes:

s131, sequentially coding a text entity to be matched and a candidate entity in a text entity pair respectively by adopting an RNN model and a CNN neural network to respectively form a text entity RNN code to be matched, a text entity CNN code to be matched, a candidate entity RNN code and a candidate entity CNN code; the input coding layer is structurally divided into two parts, namely an RNN coder and a CNN coder. The RNN is mainly used to capture sequence information of text, and the CNN is mainly used to capture keyword information of text. Both encoders will be briefly described below:

1) RNN encoder

Capturing feature information of a sentence sequence by encoding input diagnostic original words to be standardized and standard entities using a BilSTM encoder, the diagnostic original words a ═ (a1, … al)_a) Possibly corresponding standard entity b ═ (b1, … bl)_b) After passing through the BilSTM encoder, the states ai and bj of the hidden layer generated by the BiLSTM encoder on the time i node are respectively obtained, and the calculation formula is as follows:

2) CNN encoder

The model uses CNN to carry out secondary coding on the basis of RNN coding, captures the characteristic information of word granularity by using the characteristics of CNN convolution kernel, and obtains new coding information coding. The Improved CNN references the idea of NIN (embedded micro network in the network), and adopts a multilayer perceptron to replace a generalized linear model, thereby improving the abstract expression capability of the characteristics.

By combining BilSTM and CNN in the input coding layer, the model can more fully capture fine-grained characteristic information of two diagnostic texts to be compared. The enhanced input encoder would then use both RNN coding and RNN plus CNN coding to capture interaction information between input text pairs in subsequent interaction modeling layers.

Step S132, calculating the forward attention weight of the text entity RNN code to be matched relative to the candidate entity RNN code and the reverse attention weight of the candidate entity RNN code relative to the text entity RNN code to be matched; the model is similar to the ESIM model, and both belong to an interactive text matching model, so after the input coding is completed, direct interaction information between two entity text pairs to be linked is captured through an Attention mechanism, and soft Attention alignment is used to obtain sentence representation of an entity. Aligning the soft attention mechanism, firstly calculating to obtain soft attention weight between two diagnostic original words to be compared and a standard coding entity by using RNN (radio network) codes output by BilSTM (binary-coded TM) to obtain e_ij. Here, for two entity texts to be compared, two different attention weights that can be retrieved are obtained for entity mention a versus standard encoding entity b, respectivelyAnd conversely entity b against entity aRespectively by weighting and calculation. This allows for more comprehensive information of interaction between two entities to be captured simultaneously. The calculation formula is as follows:

step S133, determining a candidate entity maximum pooling vector, a candidate entity average pooling vector, a text entity maximum pooling vector to be matched and a text entity average pooling vector to be matched according to the forward attention weight, the reverse attention weight, the candidate entity RNN code, the text entity RNN code to be matched, the candidate entity CNN code to be matched and the text entity CNN code to be matched; after the entity interactive sentence is characterized by soft attention alignmentMaximum pooling and mean pooling are used to further capture feature information of the text, and then combined with the RCNN coding information, the specific calculation process is as follows:

wherein va and vb areAndand calculating difference sum, wherein ave and max are respectively mean value and maximum value pooling.

Unlike previous interactive-based text matching methods, the model uses both RNN and CNN for interactive modeling to obtain interactive representations of two texts. By combining the advantages of RNN and CNN, finer grained features can be captured. Meanwhile, due to a specific parameter sharing mechanism of the CNN convolution kernel, the parameter quantity of the model can be further reduced.

And S134, based on the full-connection network, determining the similarity between the text to be matched and the candidate text in the text entity pair according to the candidate entity maximum pooling vector, the candidate entity average pooling vector, the text entity maximum pooling vector to be matched and the text entity average pooling vector to be matched, and determining a corresponding first feature vector according to the similarity of each text entity pair. The embodiment uses a special integration layer to fuse the vector representation of two texts in the global similarity modeling. And (3) outputting the input codes of the RCNN and the soft attention weights subjected to mean value or maximum pooling to an integration layer, and introducing a threshold mechanism to carry out global similarity modeling. The integration layer mainly aims to better fuse the interactive representations of two entities to be compared, and to facilitate the subsequent calculation of the similarity of two input entities to be matched, wherein P and Q represent text representations of two texts, and DEG represents the product of corresponding position elements of two matrixes, and then the difference and the product of the elements are used for combining the two text representations, W_fAnd b_fAre trainable parameters:

m(P,Q)＝tanh(W_f[P；Q；P°Q；P-Q]+b_f)

the integration layer also performs some high-order interaction modeling, g represents a threshold mechanism, m represents different threshold gating mechanisms, and finally, the integration layer connects two outputs:

o′_a＝g(o_a,o_b)·m(o_a,o_b)+(1-g(o_a,o_b))·o_a

o′_b＝g(o_b,o_a)·m(o_b,o_a)+(1-g(o_b,o_a))·o_b

m_out＝[o′_a,o′_b]

further, at the last prediction layer, the model inputs the output of the previous step into a two-layer fully-connected layer MLP to calculate the probability that the two texts are similar. The whole model is trained end to end, and a cross entropy loss function is used in training.

After mapping the input entity pairs to the feature space, each entity sequence has a feature vector representation, and then the similarity between them can be evaluated by means of similarity calculation. For the twin network model, the present embodiment uses a cross entropy loss function, there are many very similar data in our diagnostic data, the cross entropy loss function can calculate euclidean distance to measure the difference between them, in the training process, our goal is to decrease the spatial distance between similar targets and increase the distance between them for dissimilar parts, the calculation formula of the cross entropy loss function is as follows:

wherein the input of cross entropy loss is an input entity pair x_iAnd x_j，f_iAnd f_jRespectively representing vectors of the input entity pair after encoding and mapping; m represents a distance boundary value between input samples, is a preset hyper-parameter, and y is an input label; when yi is not equal to y_jAnd if the input entities are not matched, the loss function value is the right half part of the formula, and the Euclidean distance of the sample pair is larger, the loss value is larger.

With reference to the foregoing embodiments, in the embodiment of the present invention, as shown in fig. 3, in step S11, the entity to be matched is a text entity, an english abbreviation entity, or a chinese-english mixed entity. If the text to be matched is a character entity, selecting a candidate entity set corresponding to the received text entity to be matched, wherein the candidate entity set comprises:

step S111, calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database;

s112, selecting an entity to be candidate with a Jaccard coefficient not greater than a preset value to form an entity set to be candidate;

and S113, screening entities to be candidate with the same semantic as the text entities to be matched from the entity set to be candidate to form a candidate entity set.

If the entity to be matched is an English abbreviation entity or a Chinese-English mixed entity, selecting a candidate entity set corresponding to the received text entity to be matched, wherein the candidate entity set comprises:

and step S114, adopting a trained third neural network, taking the entity to be matched as input, taking the candidate entity matched with the text to be matched as output, and forming a candidate entity set with the candidate entity corresponding to the same entity to be matched.

With reference to the foregoing embodiment, in the embodiment of the present invention, as shown in fig. 4, in step S111, the calculating a Jaccard coefficient between the text entity to be matched and the entity to be candidate stored in the database includes:

s1111, calculating Jaccard coefficients of the text entity to be matched and the entity to be candidate stored in the database by adopting a first mathematical model; the first mathematical model is:

wherein, A is a set of characters or letters forming an entity to be matched; b is_iIs the set of words or letters that constitute the ith entity to be candidate.

In this embodiment, data preprocessing and candidate entity recalling are first completed based on data derived from an electronic medical record, and a diagnostic original word data set and a candidate entity set corresponding to the diagnostic original word data set are obtained. The data are manually labeled by professional medical researchers according to the standard coding word list of the international disease classification ICD-10, and an initial < original diagnosis word, standard coding word > data set is obtained. At the moment, only positive sample data exist, the experiment cannot be completed, and the quality of the negative sample structure is an important influence factor influencing the training of the candidate entity disambiguation stage model. The manual labeling samples taken by the embodiment lack negative samples, firstly a negative sample data set is constructed based on the existing data, and finally the number of positive samples is expanded according to the requirement, so that the problem of unbalance of the positive samples and the negative samples is solved.

(1) Data sources of the negative sample training set: for negative samples in the training set, the experimental results of the candidate entity recall stage are used for reference, and the specific training set data comprises:

recall data of TOP20 in the recall phase, in which data other than positive samples are selected as negative samples;

selecting a part of synonym disease word data as a difficult sample based on a part of synonym disease word data captured from a preset database (such as a universal clinical diagnosis and treatment knowledge base);

randomly extracting a part of labels as negative samples according to the data labeled by the experts and a standard coding library;

and increasing the number of positive samples by adopting a random replacement deletion mode according to the number of the constructed negative samples so as to balance the number of the samples.

(2) And constructing a training set based on the Jaccard coefficient. In general, researchers usually adopt a manual method to construct a negative sample data set, but the manual construction not only needs solid professional knowledge, but also has a large workload, and in addition, the manual construction is inevitable to have a certain error, and a part of labeling errors may occur, so the embodiment proposes an automatic method for constructing a medical term standardized training set, and pseudo codes of the algorithm are shown in table 1.

TABLE 1 negative sample construction algorithm based on Jaccard coefficient

The construction method provided by the invention is based on the Jaccard coefficient, two medical diagnosis entities are given, the character string sets of the two medical diagnosis entities are A and B respectively, and the Jaccard coefficients of the two entities are calculated by adopting a first mathematical model.

Specifically, in the construction process, firstly, an ICD-10 standard coding table is traversed, then the Jaccard coefficients of the ICD-10 standard coding table and the current diagnosis original word and the Jaccard coefficients of the standard word and the standard coding words except the standard word are respectively calculated, and finally, the entity pair with the Jaccard coefficient larger than a given threshold value is selected to be added into the negative sample data set, so that the quality of the training data set is improved.

(3) The entity refers to an extension.

The entity designation item expansion is an important step for improving the coverage rate of candidate entities and standard entities, the problems of abbreviation, personal habit characteristic expression and the like of terms usually occur in electronic medical record data, and ways of constructing ambiguity, alias word list, abbreviation full-name mapping table and the like can be adopted. The embodiment expands the candidate entity set by using a mode of constructing an ambiguity and alias word list and an abbreviation full-name mapping list.

1) An english abbreviation comparison table is constructed, and part of the table is shown in fig. 8:

2) and constructing a synonym table, wherein synonym data in the synonym table is captured from webpage data of a certain clinical diagnosis and treatment knowledge base, and 1210 synonym pairs are obtained. The partial synonyms are shown in table 2.

TABLE 2 synonym table

(4) Data set construction

In addition to the < original diagnosis word, standard code > positive sample labeled manually by the expert, this embodiment randomly replaces the word list with various kinds of mapping words, constructs a negative sample set through the filtering in the recall stage and based on the Jaccard coefficient, and submits the negative sample set to the expert again to review the data set, and finally obtains an 17905 medical entity standardized data set, as shown in table 3.

TABLE 3 medical entity standardized data set

Experimental setup:

in the experiment for constructing a negative sample based on the Jaccard coefficient, the threshold t was set to 0.7. For the semantic similarity matching model part, the development environment was the ubuntu18.04 system, developed based on python3.6 and pytorch 1.4. BM25 algorithm is selected as a baseline experiment model, BM25 is a classical probability retrieval algorithm, experiments are respectively carried out based on a twin network and a semantic matching model based on interaction enhancement, and traditional language statistical characteristics are blended into the model for comparison. The dimension of a word vector of the model in the experiment is 128, the dimension of a full-connection layer of the traditional language features in the matching model integrated with the statistical features is 100 in parameter setting, the result with the highest similarity score output in the experiment is used as a predicted link standard entity, and the accuracy result unit is (%) in the experiment.

And (3) analyzing an experimental result:

in the experimental setting, the result with the highest similarity score is selected as the predicted link standard entity, and after the experimental result is viewed, part of clinical terms have the characteristic of multiple implications, as shown in table 5, which influences the accuracy of the experimental result. Only single implication cases were considered in this experiment (out of 1500 test data, the normalized prediction results of 1230 single linked entities were taken). The BM25 is a baseline model of an experiment, and the experimental result pair is shown in table 4, and in the aspect of the model, under the condition that other conditions are the same, the deep semantic matching model is significantly improved for the BM25 baseline model. For a deep semantic matching model, the experimental result based on the twin network is relatively low, and the considered reason is that the model models the basic information of an input entity pair, and ignores the interaction information between two entities in the encoding process. The experimental result of the interaction enhancement model based on the attention mechanism is improved by 5.56%, and in addition, after the traditional language features are integrated, the model accuracy of the two network structures is improved by more than 1.5%. However, the improvement effect of the interaction-enhanced matching model introduced with the attention mechanism is relatively insignificant, which may be because the feature extraction capability of the interaction-enhanced matching model itself is already strong, and the improvement effect of the manner of manually extracting features and then handing the features to the model is limited, so that the improvement of the feature extraction capability of the model itself will be mainly considered in future research.

TABLE 4 Single implication entity disambiguation accuracy

For results that were not accurately predicted from the experimental results, some representative error samples were extracted, as shown in table 5. In most cases, the core symptoms are the same in the disease, but the sites of modification are different, such as the "frontal bone" and the "skull bone". The most wrong result is the situation of multiple implications of the diagnostic text, which proves that the problem of the multiple implications cannot be ignored.

Tables 4-5 Experimental error samples

According to the standard entity text determining method based on the BilSTM model, the twin network model and the text matching model based on the attention mechanism and the interaction enhancement are fused, the traditional language feature information is fused into the deep learning model of the BiLSTM model, the effectiveness of the standard entity text determining method is proved through the experimental result, the standard entity text determining efficiency is improved, and the accuracy of the standard entity text determining result is improved.

Example 2

Fig. 9 is a block diagram of a device for determining a standard entity text based on a BiLSTM model according to an embodiment of the present invention, and this embodiment is described by applying the device to the method for determining a standard entity text based on a BiLSTM model shown in fig. 1. The device at least comprises the following modules:

a selecting module 51, configured to select, for a received text entity to be matched, a candidate entity set corresponding to the text entity to be matched;

a team forming module 52, configured to form, for each candidate entity in the candidate entity set, a text entity pair with the text entity to be matched respectively;

the feature vector module 53 is configured to, for each text entity pair, calculate a first similarity feature vector of the text entity pair by using a preset neural matching network, and calculate a second similarity feature vector of the text entity pair by using a text statistical method and a full-connection network;

a similarity module 54, configured to splice the first similarity feature vector and the second similarity feature vector of each text entity pair to form a similarity vector of each entity pair by using a splicing network, and output a similarity between two entity texts in each entity pair according to the similarity vector of each text entity pair;

and an entity determining module 55, configured to determine a candidate text entity in the text entity pair with the highest similarity as a standard text entity corresponding to the text entity to be matched.

The standard entity text determination device based on the BilSTM model provided by the embodiment of the application can be used for the method executed in the embodiment 1, and the implementation principle and the technical effect are similar by referring to the embodiment of the method for the relevant details, and are not repeated herein.

It should be noted that: in the foregoing embodiment, when the standard entity text determining apparatus based on the BiLSTM model performs the standard entity text determining method based on the BiLSTM model, only the division of the functional modules is used for illustration, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the standard entity text determining apparatus based on the BiLSTM model is divided into different functional modules to complete all or part of the functions described above. In addition, the standard entity text determination device based on the BiLSTM model and the standard entity text determination method based on the BiLSTM model provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments and is not described herein again.

Example 3

The standard entity text determination apparatus based on the BiLSTM model according to the embodiment of the present invention is used for determining a standard entity text based on the BiLSTM model, as shown in fig. 10, the electronic device includes a processor 1001 and a memory 1002, where the processor 1001 and the memory 1002 may be connected by a bus or in another manner, and fig. 10 takes the connection by a bus as an example.

The Processor 1001 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), a Graphics Processing Unit (GPU), an embedded Neural Network Processor (NPU), other dedicated deep learning coprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like, or a combination thereof.

The memory 1002, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to a standard entity text determination method based on the BiLSTM model in the embodiments of the present invention. The processor 1001 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 1002, that is, implements one of the standard entity text determination methods based on the BiLSTM model in the above method embodiment 1.

The memory 1002 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 1001, and the like. Further, the memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1002 may optionally include memory located remotely from the processor 1001, which may be coupled to the processor 1001 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 1002 and, when executed by the processor 1001, perform a method for determining standard entity text based on the BilTM model, as shown in FIG. 1.

An embodiment of the present invention further provides a non-transitory computer-readable storage medium, where computer-executable instructions are stored in the non-transitory computer-readable storage medium, and the computer-executable instructions may execute a method for determining a standard entity text based on a BiLSTM model in any of the above method embodiments. The non-transitory computer readable storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the non-transitory computer readable storage medium may also include a combination of memories of the above kind.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, apparatus or non-transitory computer readable storage medium, all relating to or comprising a computer program product.

Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Obviously, the above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that various changes and modifications to the above description could be made by those skilled in the art without departing from the spirit of the present application. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

24页详细技术资料下载

Standard entity text determination method and device based on BilSTM model and storage medium

相关技术

网友询问留言