Method for combining named entity identification and entity relation extraction

文档序号:661728 发布日期:2021-04-27 浏览:2次 中文

阅读说明:本技术 一种命名实体识别和实体关系抽取的联合方法 (Method for combining named entity identification and entity relation extraction ) 是由 何彬彬 吴军 樊昭磊 张伯政 桑波 于 2021-01-05 设计创作,主要内容包括:一种命名实体识别和实体关系抽取的联合方法,解决了Pipeline方法中实体识别与关系抽取相互隔离的问题,不同于Joint(联合)方法中实体识别与关系抽取部分参数共享,本发明在多步迭代过程中,实现了实体识别结果与关系抽取结果多次融合、相互影响,使得命名实体识别与实体关系抽取同步进行,进一步提高识别准确率。(A method for combining named entity identification and entity relationship extraction solves the problem that entity identification and relationship extraction in a Pipeline method are mutually isolated, and is different from parameter sharing of part of entity identification and relationship extraction in a Joint method.)

1. A method for combining named entity identification and entity relationship extraction is characterized by comprising the following steps:

a) processing an input electronic medical record text S, an Entity position mark tensor Entity _ label and a text Entity Relation mark relationship _ label;

b) the method comprises the steps of pre-training a language model, inputting an electronic medical record text S into the pre-training language model to obtain tensor expression H of the text, generating Entity position mark embedding Entity, and outputting the Entity position mark belonging to Rentity_size×EThe entity _ size is the number of entity types, and E is the dimension of word embedding in the language model;

c) setting the iteration number K, wherein K is more than or equal to 1 and is a positive integer, and in the K-th iteration, if K is 1, Hk=H,HkIs the initial input of the kth round, if K is more than 1 and less than or equal to K, is as followsThe iteration tensor of the k-1 round output;

d) h is to bekInputting the Entity position mark tensor Entity _ label into the Entity recognition model, and outputting a probability tensor PkAnd entity identification marking result BkThe probability tensor PkAnd performing inner product operation with Entity position mark embedded Entity, and expressing the Entity position probability of the output text as PEk∈RL×EL is the maximum character length in the electronic medical record text S;

e) h is to bekInputting the Relation label relationship _ label with the text entity into the Relation model, and outputting the Relation loss relationshipkProbability matrix of relation to entity PRk∈RL×L

f) Entity position probability PE of textkProbability matrix of relation to entity PRkInputting the k-th iteration tensor into the attention model and outputting the k-th iteration tensor

g) Increasing the iteration number, making K equal to K +1, if K is greater than K, ending the iteration and executing step h), and if K is less than or equal to K, executing step c);

h) inputting the text S of the electronic medical record, and the entity identification matrix B in the K-th roundK∈RL×entity_sizeFor the entity recognition result, an entity relationship probability matrix PRK∈RL×L×relattion_sizeThe relationship _ size is the number of relationship types.

2. The method of claim 1, wherein the processing of the electronic medical record text S in step a) comprises: and cutting off the part of the electronic medical record text S with the length exceeding L, and padding and completing the part of the electronic medical record text S with the length less than L.

3. The method of claim 2, wherein the processing of the Entity position mark tensor Entity label in step a) comprises: marking the position and dimension of the Entity for tensor Entity _ labelDegree is RLAnd R is a real number space, wherein the entity type number is entry _ size.

4. The method of claim 2, wherein the processing of the text entity relationship label in step a) comprises: marking the Relation between entities by the entity Relation mark relationship _ label with the dimension of RL×LWherein the relationship type number is relationship _ size.

5. The method of claim 1, wherein step b) comprises the steps of:

b-1) pre-training the language model by selecting a Bert or Albert or GPT method, wherein the dimensionality of the text tensor H is RL×EL and E are positive integers;

b-2) Embedding the Entity position mark tensor Entity _ label through the Embedding layer output Entity position mark by using a formula Entity _ label.

6. The method of claim 1, wherein the entity recognition model in step d) outputs CRF loss function and probability tensor P by combining a Transformer module with conditional random field CRFkAnd entity identification marking result Bk

7. The method of claim 6, wherein step d) comprises the steps of:

d-1) Using formula Gk=Transformer(Hk) Calculating output G of the text tensor H after passing through the Transformer network layerkBy the formula Bk=MLP(Gk) Calculation of GkEntity identification marking result B output by using MLP neural networkkUsing the formula Pk=Forward_backward(BkEntity) calculation of BkGeneralization by using antecedent-backward algorithm of CRF modelRate tensor Pk

d-2) combining the probability tensors PkInputting the Entity position mark tensor Entity _ label into a CRF model by a formula CRF _ lossk=CRF(PkEntity _ label) calculates the loss CRF _ losskProbability tensor PkWith Entity position mark tensor Entity _ label through formula PEk=Pk*Entity∈RL×EEntity position probability PE for obtaining text by inner product operationk

8. The method of claim 4, wherein step e) comprises the steps of:

e-1) reacting HkInputting into two different transform models to respectively obtain Y1∈RL×E,Y2∈RL×EBy the formula PRk=Sigmoid(Yk) Calculating to obtain an entity relation probability matrix PRkIn the formula Is Y2Transpose of (Y)k∈RL×L×relation_size

e-2) through the formula relationship _ lossk=Cross_entropy(PRkRelationLabel) as the output relationship loss RelationLassk,Relation_label∈RL×L

9. The method of claim 1, wherein step f) comprises the steps of:

f-1) by the formula MEk=PEk+Hk∈RL×ECalculating to obtain MEk

f-2) by the formulaFor MEkDividing in an embedding dimension, wherein the division number is head _ num, the head _ num is a positive integer and can completely divide the embedding dimension E, calculating the divided embedding dimension F according to a formula F which is E/head _ num, and the divided embedding dimension F is calculated

f-3) by the formula Calculating to obtain a structural attention vector Qj、Kj、Vj,MLP1、MLP2、MLP3MLP neural networks with different weight parameters;

f-4) using entity relationship probability matrix PRk∈RL×LBy the formulaOutput of the computer attention mechanism Oj,Oj∈RL×Fλ ∈ R is a real number, M is from { MLP (PR)k),MLP(PRk)TONE is selected from ONE ∈ RL×LIs an identity matrix, MLP (PR)k)∈RL×L,MLP(PRk)TIs MLP (PR)k) Transposing;

f-5) by the formulaMixing O withjSplicing in embedding dimension to obtain iteration tensor

10. Named entity identification and named entity recognition in accordance with claim 7The combination method of entity relation extraction is characterized in that in the step h), the formula is used for extracting the entity relationCalculating the total loss of the model, wherein alphaiAnd betamAre all weight coefficients, αiAnd betamAll real numbers are real numbers, and a random gradient descent method is adopted to optimize the total loss for gradual training.

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for combining named entity identification and entity relationship extraction.

Background

Natural language processing is a branch field of artificial intelligence research, with the development of internet informatization, natural language data is continuously accelerated to accumulate, how to digest the data, extract knowledge information from the data and then carry out reasoning is the key point of the natural language processing research, wherein named entity identification and entity relationship extraction are the most representative.

However, the traditional named entity identification and entity relationship extraction mainly have two technical directions:

pipeline procedure: the method firstly identifies named entities in a text, and then identifies the relationship between the entities by combining the text and the identification result of the named entities. According to the method, named entity identification and entity relation extraction are carried out separately, and errors of the final identification result are accumulated between the two steps, so that the final identification result has larger errors.

Joint method: the method identifies the named entities and the entity relations in the text at the same time, and although error accumulation is avoided on the surface, the real situation is that different named entity identification schemes influence the scheme of the entity relation identification thereof, and different entity relation identification schemes influence the scheme of the named entity identification in turn. The existing combination method only shares part of model parameters of named entity identification and relation extraction, and does not really realize mutual reference and mutual influence of identification results of the named entity identification and the relation extraction and then carry out corresponding adjustment. The current combination method does not have the accumulation of errors, but the actual effect is not as high as the final accuracy using the Pipeline method.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a combination method which can simultaneously carry out named entity identification and entity relation extraction, can mutually influence and refer to the results of the named entity identification and the relation extraction, and then improves the final accuracy.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

a combined method of named entity identification and entity relationship extraction comprises the following steps:

a) processing an input electronic medical record text S, an Entity position mark tensor Entity _ label and a text Entity Relation mark relationship _ label;

b) the method comprises the steps of pre-training a language model, inputting an electronic medical record text S into the pre-training language model to obtain tensor expression H of the text, generating Entity position mark embedding Entity, and outputting the Entity position mark belonging to Rentity_size×EThe entity _ size is the number of entity types, and E is the dimension of word embedding in the language model;

c) setting the iteration number K, wherein K is more than or equal to 1 and is a positive integer, and in the K-th iteration, if K is 1, Hk=H,HkIs the initial input of the kth round, if K is more than 1 and less than or equal to K,the iteration tensor for the k-1 th round of output;

d) h is to bekInputting the Entity position mark tensor Entity _ label into the Entity recognition model, and outputting a probability tensor PkAnd entity identification marking result BkThe probability tensor PkAnd performing inner product operation with Entity position mark embedded Entity, and expressing the Entity position probability of the output text as PEk∈RL×EL is the maximum character length in the electronic medical record text S;

e) h is to bekInputting the Relation label relationship _ label with the text entity into the Relation model, and outputting the Relation loss relationshipkProbability matrix of relation to entity PRk∈RL×L

f) Entity position probability PE of textkProbability matrix of relation to entity PRkInputting the k-th iteration tensor into the attention model and outputting the k-th iteration tensor

g) Increasing the iteration number, making K equal to K +1, if K is greater than K, ending the iteration and executing step h), and if K is less than or equal to K, executing step c);

h) inputting the text S of the electronic medical record, and the entity identification matrix B in the K-th roundK∈RL×entity_sizeFor the entity recognition result, an entity relationship probability matrix PRK∈RL×L×relattion_sizeThe relationship _ size is the number of relationship types. Further, in step a), applying electricityThe processing of the child medical record text S comprises the following steps: and cutting off the part of the electronic medical record text S with the length exceeding L, and padding and completing the part of the electronic medical record text S with the length less than L.

Further, the processing of the Entity position mark tensor Entity _ label in the step a) includes: marking the Entity position of tensor Entity _ label with dimension RLAnd R is a real number space, wherein the entity type number is entry _ size.

Further, the processing of the text entity relationship label relationship _ label in step a) includes: marking the Relation between entities by the entity Relation mark relationship _ label with the dimension of RL×LWherein the relationship type number is relationship _ size.

Further, step b) comprises the following steps:

b-1) pre-training the language model by selecting a Bert or Albert or GPT method, wherein the dimensionality of the text tensor H is RL×EL and E are positive integers;

b-2) Embedding the Entity position mark tensor Entity _ label through the Embedding layer output Entity position mark by using a formula Entity _ label.

Further, in the step d), the entity recognition model outputs a CRF loss function and a probability tensor P by adopting a mode of combining a Transformer module and a conditional random field CRFkAnd entity identification marking result Bk

Further, step d) comprises the following steps:

d-1) Using formula Gk=Transformer(Hk) Calculating output G of the text tensor H after passing through the Transformer network layerkBy the formula Bk=MLP(Gk) Calculation of GkEntity identification marking result B output by using MLP neural networkkUsing the formula Pk=Forward_backward(BkEntity) calculation of BkObtaining probability tensor P by utilizing antecedent-backward algorithm of CRF modelk

d-2) combining the probability tensors PkInputting the Entity position mark tensor Entity _ label into a CRF model by a formula CRF _ lossk=CRF(PkEntity _ label) calculates the loss CRF _ losskProbability tensor PkWith Entity position mark tensor Entity _ label through formula PEk=Pk*Entity∈RL×EEntity position probability PE for obtaining text by inner product operationk

Further, step e) comprises the steps of:

e-1) reacting HkInputting into two different transform models to respectively obtain Y1∈RL×E,Y2∈RL×EBy the formula PRk=Sigmoid(Yk) Calculating to obtain an entity relation probability matrix PRkIn the formulaIs Y2Transpose of (Y)k∈RL×L×relation_size

e-2) through the formula relationship _ lossk=Cross_entropy(PRkRelationLabel) as the output relationship loss RelationLassk,Relation_label∈RL×L

Further, step f) comprises the steps of:

f-1) by the formula MEk=PEk+Hk∈RL×ECalculating to obtain MEk

f-2) by the formulaFor MEkDividing in an embedding dimension, wherein the division number is head _ num, the head _ num is a positive integer and can completely divide the embedding dimension E, calculating the divided embedding dimension F according to a formula F which is E/head _ num, and the divided embedding dimension F is calculatedj={0,1,...,head_num-1};

f-3) by the formula Calculating to obtain a structural attention vector Qj、Kj、Vj,MLP1、MLP2、MLP3MLP neural networks with different weight parameters;

f-4) using entity relationship probability matrix PRk∈RL×LBy the formulaOutput of the computer attention mechanism Oj,Oj∈RL×Fλ ∈ R is a real number, M is from { MLP (PR)k),MLP(PRk)TONE is selected from ONE ∈ RL×LIs an identity matrix, MLP (PR)k)∈RL×L,MLP(PRk)TIs MLP (PR)k) Transposing;

f-5) by the formulaMixing O withjSplicing in embedding dimension to obtain iteration tensor

Further, step h) is performed by the formulaCalculating the total loss of the model, wherein alphaiAnd betamAre all weight coefficients, αiAnd betamAll real numbers are real numbers, and a random gradient descent method is adopted to optimize the total loss for gradual training.

The invention has the beneficial effects that: the method solves the problem of mutual isolation of entity identification and relationship extraction in the Pipeline method by a combined method of named entity identification and entity relationship extraction, is different from parameter sharing of part of entity identification and relationship extraction in a Joint method, realizes multiple fusion and mutual influence of an entity identification result and a relationship extraction result in a multi-step iteration process, enables the named entity identification and the entity relationship extraction to be carried out synchronously, and further improves the identification accuracy.

Detailed Description

The present invention is further explained below.

A combined method of named entity identification and entity relationship extraction comprises the following steps:

a) processing an input electronic medical record text S, an Entity position mark tensor Entity _ label and a text Entity Relation mark relationship _ label;

b) the method comprises the steps of pre-training a language model, inputting an electronic medical record text S into the pre-training language model to obtain tensor expression H of the text, generating Entity position mark embedding Entity, and outputting the Entity position mark belonging to Rentity_size×EThe entity _ size is the number of entity types, and E is the dimension of word embedding in the language model;

c) setting the iteration number K, wherein K is more than or equal to 1 and is a positive integer, and in the K-th iteration, if K is 1, Hk=H,HkIs the initial input of the kth round, if K is more than 1 and less than or equal to K,the iteration tensor for the k-1 th round of output;

d) h is to bekInputting the Entity position mark tensor Entity _ label into the Entity recognition model, and outputting a probability tensor PkAnd entity identification marking result BkThe probability tensor PkAnd performing inner product operation with Entity position mark embedded Entity, and expressing the Entity position probability of the output text as PEk∈RL×EL is the maximum character length in the electronic medical record text S;

e) h is to bekInputting the Relation label relationship _ label with the text entity into the Relation model, and outputting the Relation loss relationshipkProbability matrix of relation to entity PRk∈RL×L

f) Entity position probability PE of textkProbability matrix of relation to entity PRkInputting the k-th iteration tensor into the attention model and outputting the k-th iteration tensor

g) Increasing the iteration number, making K equal to K +1, if K is greater than K, ending the iteration and executing step h), and if K is less than or equal to K, executing step c);

h) inputting the text S of the electronic medical record, and the entity identification matrix B in the K-th roundK∈RL×entity_sizeFor the entity recognition result, an entity relationship probability matrix PRK∈RL×L×relattion_sizeThe relationship _ size is the number of relationship types. The method solves the problem of mutual isolation of entity identification and relationship extraction in the Pipeline method, is different from the parameter sharing of the entity identification and relationship extraction part in the Joint method, realizes the multiple fusion and mutual influence of the entity identification result and the relationship extraction result in the multi-step iteration process, leads the named entity identification and the entity relationship extraction to be carried out synchronously, and further improves the identification accuracy.

Further, the processing of the electronic medical record text S in step a) includes: and cutting off the part of the electronic medical record text S with the length exceeding L, and padding and completing the part of the electronic medical record text S with the length less than L.

Further, the processing of the Entity position mark tensor Entity _ label in the step a) includes: marking the Entity position of tensor Entity _ label with dimension RLAnd R is a real number space, wherein the entity type number is entry _ size.

Further, the processing of the text entity relationship label relationship _ label in step a) includes: marking the Relation between entities by the entity Relation mark relationship _ label with the dimension of RL×LWherein the relationship type number is relationship _ size.

Further, step b) comprises the following steps:

b-1) pre-training the language model by selecting a Bert or Albert or GPT method, wherein the dimensionality of the text tensor H is RL×EL and E are positive integers;

b-2) Embedding the Entity position mark tensor Entity _ label through the Embedding layer output Entity position mark by using a formula Entity _ label.

Further, the entity recognition model in step d) is adoptedThe method of combining the Transformer module with the conditional random field CRF outputs a CRF loss function and a probability tensor PkAnd entity identification marking result Bk

Further, step d) comprises the following steps:

d-1) Using formula Gk=Transformer(Hk) Calculating output G of the text tensor H after passing through the Transformer network layerkBy the formula Bk=MLP(Gk) Calculation of GkEntity identification marking result B output by using MLP neural networkkUsing the formula Pk=Forward_backward(BkEntity) calculation of BkObtaining probability tensor P by utilizing antecedent-backward algorithm of CRF modelk

d-2) combining the probability tensors PkInputting the Entity position mark tensor Entity _ label into a CRF model by a formula CRF _ lossk=CRF(PkEntity _ label) calculates the loss CRF _ losskProbability tensor PkWith Entity position mark tensor Entity _ label through formula PEk=Pk*Entity∈RL×EEntity position probability PE for obtaining text by inner product operationk

Further, step e) comprises the steps of:

e-1) reacting HkInputting into two different transform models to respectively obtain Y1∈RL×E,Y2∈RL×EBy the formula PRk=Sigmoid(Yk) Calculating to obtain an entity relation probability matrix PRkIn the formulaIs Y2Transpose of (Y)k∈RL×L×relation_size

e-2) through the formula relationship _ lossk=Cross_entropy(PRkRelationLabel) as the output relationship loss RelationLassk,Relation_label∈RL×L

Further, step f) comprises the steps of:

f-1) by the formula MEk=PEk+Hk∈RL×ECalculating to obtain MEk

f-2) by the formulaFor MEkDividing in an embedding dimension, wherein the division number is head _ num, the head _ num is a positive integer and can completely divide the embedding dimension E, calculating the divided embedding dimension F according to a formula F which is E/head _ num, and the divided embedding dimension F is calculated

f-3) by the formula Calculating to obtain a structural attention vector Qj、Kj、Vj,MLP1、MLP2、MLP3MLP neural networks with different weight parameters;

f-4) using entity relationship probability matrix PRk∈RL×LBy the formulaOutput of the computer attention mechanism Oj,Oj∈RL×Fλ ∈ R is a real number, M is from { MLP (PR)k),MLP(PRk)TONE is selected from ONE ∈ RL×LIs an identity matrix, MLP (PR)k)∈RL×L,MLP(PRk)TIs MLP (PR)k) Transposing; f-5) by the formulaMixing O withjSplicing in embedding dimension to obtain iteration tensor

Further, step h) is performed by the formulaCalculating the total loss of the model, wherein alphaiAnd betamAre all weight coefficients, αiAnd betamAll real numbers are real numbers, and a random gradient descent method is adopted to optimize the total loss for gradual training.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:地址信息抽取方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!