Domain map entity and relationship combined extraction method and system based on pre-training model

文档序号:1952821 发布日期:2021-12-10 浏览:10次 中文

阅读说明:本技术 一种基于预训练模型的领域图谱实体和关系联合抽取方法及系统 (Domain map entity and relationship combined extraction method and system based on pre-training model ) 是由 朱静丹 姚俊杰 于 2021-08-12 设计创作,主要内容包括:本发明公开了一种基于预训练模型的领域图谱实体和关系联合抽取方法,包括以下步骤:步骤A:抓取保险公司相关网站上的保险领域文本信息,数据清洗标注,建立初始数据集U和候选关系集V;步骤B:基于预训练模型,构建关系判别和实体对抽取的联合学习框架,对模型进行训练和测试;步骤C:将测试过程中产生的新抽取数据经过筛选后扩增训练集;步骤D:用更新后的数据集重复迭代直至模型稳定;步骤E:三元组数据导出处理,构建领域知识图谱。本发明还提供了一种实现上述方法的系统。本发明涉及到的方法将目标关系与文本的每一个词相互作用,精确地产生所有可能的实体对,自然地避免了实体重叠问题,同时可以提取多关系和多实体对。(The invention discloses a domain map entity and relation combined extraction method based on a pre-training model, which comprises the following steps of: step A: capturing insurance field text information on a related website of an insurance company, cleaning and labeling data, and establishing an initial data set U and a candidate relationship set V; and B: constructing a relation discrimination and entity pair extraction combined learning framework based on a pre-training model, and training and testing the model; and C: screening new extracted data generated in the test process and amplifying a training set; step D: repeating iteration by using the updated data set until the model is stable; step E: and (5) triple data export processing, and constructing a domain knowledge graph. The invention also provides a system for realizing the method. The method of the present invention interacts the target relationship with each word of the text to accurately generate all possible entity pairs, naturally avoiding the entity overlap problem, while extracting multiple relationships and multiple entity pairs.)

1. A domain map entity and relation combined extraction method based on a pre-training model is characterized by comprising the following steps:

step A: acquiring original data, dividing the data into a training set and a test set after marking the data, and establishing a preliminary small-scale insurance marking data set U and a candidate relationship set V;

and B: constructing a combined learning framework of relationship discrimination and entity pair extraction based on a pre-training model, and training and testing the combined learning framework model;

and C: screening new extracted data generated in the test process, adding the screened new extracted data into a training set, amplifying and updating the training set;

step D: repeating iteration by using the updated data set until the model is stable;

step E: and exporting and cleaning the triple data, and constructing a domain knowledge graph.

2. The method of claim 1, wherein step a further comprises the steps of:

step A1: capturing insurance field text information on a related website of an insurance company, cleaning and labeling data, and performing data processing according to the following steps: 3, dividing the proportion into a training set and a test set, and establishing a small-scale insurance marking data set U;

step A2: and (5) reserving common relations from the semi-structured text according to rules to form a candidate relation set V.

3. The method according to claim 2, wherein in step a1, crawlers are used to capture semi-structured data of websites with specific insurance, and the semi-structured data are finally uniformly retained in a text form, and data cleaning and labeling are performed at the same time, including effective text paragraph screening and sentence triple data labeling, to construct a small-scale insurance labeling data set U; the semi-structured data comprises product introduction and comparative analysis objects; the product introduction comprises a product name and product clauses; the comparison and analysis objects comprise the guarantee years, the claim payment proportion and the claim free amount.

4. The method of claim 2, wherein in step a2, the rule is based on template matching extraction of the manually summarized template combined synonym and re module; the common relations comprise dangerous varieties, guarantee types, insurance channels, payment years, guarantee responsibilities, payment types, insurance budgets, annual insurance fees, insurance ages, waiting periods, hesitation periods, payment periods, accidents/casualties, exemption responsibilities, occupation grades, insureable occupation ranges, highest insurance amounts, special rights and interests, health notice, normal underwriting, policy years, terminal diseases, guarantee years, payment proportions, non-loss amounts and insurance companies; and the candidate relation set V provides assistance for relation discrimination.

5. The method of claim 1, wherein step B further comprises the steps of:

step B1: taking the sentence as the input of a pre-training model to obtain a coding vector of the whole sequence;

step B2: judging the relation by using a classifier constructed by a multi-Convolution Neural Network (CNN);

step B3: extracting all possible entity pairs according to the relation obtained by the two classifiers by using an Attention mechanism Attention and a long-short term memory network LSTM;

step B4: and (5) performing joint training, calculating loss and iterating the model.

6. The method of claim 5, wherein in step B1, the sentence S is [ w ] using a Transformer-based network to capture context semantic information efficiently with a pre-trained model coding module1,…wn]N represents the length of a sentence, and is used as input to obtain a feature vector representation of the sentence sequence; using the pre-training model BERT as the basic encoder, the sentence w is obtainediIs a context of (A) represents xiThe BERT outputs are as follows:

{x1,…,xn}=BERT({w1,…,wn})

wherein, the feature code x of each word in the sentenceiThe corresponding mark, segment and location information is summed.

7. The method according to claim 5, wherein in step B2, the relationship classification part in the two classifiers is used to identify the relationship type contained in the text, and the output of the two classifiers represents the probability distribution of whether the corresponding relationship is a possible relationship:

P=Softmax(MaxPool(Conv(X))),

where P is the probability distribution of the output, Softmax (-) is the activation function, MaxPool (-) is the maximum pooling operation, Conv (-) is the convolution operation, and X ═ X1,…,xn]Is a coded representation of a sentence.

8. The method of claim 5, wherein in step B3, given a text and the target relationship type output by the binary classifier, all possible pairs of entities are extracted, and an entity is determined by identifying the start and end position indices of words in the text, with the following formula:

whereinAttention weight, d, of the current word in the sentence acquired for attention mechanismtFor the hidden state of the LSTM decoder, the model explores all possible relationships at once, predicting all possible pairs of entities for a given relationship.

9. The method of claim 5, wherein in step B4, the whole model is constructed as an end-to-end block model, and joint training is implemented from text input to final relationship and entity pair output.

10. The method of claim 1, wherein in step C, the training set is augmented by screening the newly extracted data generated during the test, wherein the screening comprises filtering the erroneous data and adding representative or first-appearing data to the screening.

11. The method of claim 1, wherein in step D, the model is retested using the updated data set when either of the following two conditions occur: 1) the combined loss L is less than or equal to 0.1 or the F1 score is more than or equal to 0.8; or 2) after the training data is updated, the model effect is not improved for two times continuously; stopping training, wherein the model is stable and finally tends to be optimal, otherwise, continuing training the model;

the F1 score is a measure of the classification problem and is a harmonic mean of the accuracy and the recall rate, the maximum is 1, and the minimum is 0;

the joint loss is calculated by the following formula:

L=λ·Lrel+(1-λ)Lent

wherein, the lambda is a hyper-parameter used for judging the balance relation and identifying the entity pair; l isrelIs a loss of relationship discrimination; l isentIs the loss of entity pair identification, each part is calculated following a cross-entropy loss function.

12. The method of claim 1, wherein in step E, the triple data is represented as < head entity, relationship, tail entity >; the cleaning operation refers to error correction, duplicate removal and denoising of data and manual processing; the domain knowledge graph is used for visually sensing the extraction condition of graph data under the visual condition, and is convenient for further analysis.

13. A domain graph entity and relationship joint extraction system based on a pre-trained model, the system being configured to implement the method according to any one of claims 1 to 12, the system comprising:

the data acquisition module is used for acquiring data information of the public insurance website, and screening and marking the data information to form a small-scale insurance marking data set U and a candidate relationship set V;

the relation judging module is used for judging the relation existing in each input sentence;

the entity pair identification module is used for identifying all entity pairs in the sentence according to the relationship obtained by judgment;

the data amplification module is used for continuously adding training data and updating a training set of the model;

the map construction module is used for finishing triple data export and insurance map construction;

a BERT coding module: the method is used for effectively capturing context semantic information and taking the sentence as the input of a pre-training model to obtain the feature vector representation of the sentence sequence.

Technical Field

The invention belongs to the technical field of big data, and relates to a domain map entity and relation combined extraction method and system based on a pre-training model, which are used for deep learning in research and analysis related to acquisition of domain map triple data.

Background

With the development of the mobile internet, the possibility of interconnection of everything is increased, data generated by the interconnection is also increased explosively, and the data can be just used as effective raw materials for analyzing the relationship. In the era of mobile internet, the relationship between individuals must become an important part of the deep analysis required by people. Knowledge maps are "likely" to serve as a place of employment whenever there is a need for relational analysis. From the beginning of Google search, no knowledge map is related to the current chat robots, big data wind control, security investment, intelligent medical treatment, adaptive education and recommendation systems. A knowledge graph is a special graph data that is semantic and reusable: the knowledge-graph data can be repeatedly used by multi-domain applications once being acquired, and the knowledge-graph data is a construction motivation of knowledge-graph services. Its heat in the technical field has also risen year by year due to the particularities of the structure.

Therefore, the acquisition problem of the map data is very important. The standard for judging whether the knowledge graph operates well is usually to see the data diversity and the data scale. The process of data acquisition, cleaning, extraction and even matching fusion is an important part for constructing the knowledge graph, and how to better complete data extraction is a particularly critical step in the knowledge graph.

The development of deep learning provides great help for the analysis of such problems. Due to the fact that the types of map data are various, data sources are diverse, and data are implicitly related, the traditional method is not beneficial to modeling a multi-feature multi-source scene, deep learning is good at modeling and analyzing the multi-feature and multi-source data by means of a unique multilayer network structure, and accordingly map data with larger information content and higher research value are obtained.

Existing research is more concerned with two separate sub-problems of entity identification and relationship prediction. They divide the extraction process of the whole triple data into two separate subproblems and train the models respectively. But the important characteristics of the combination between each step are ignored, the map construction process is complicated, the combination training cannot be realized, and the extraction work is finished under the condition of one problem.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a domain map entity and relationship combined extraction method and system based on a pre-training model.

According to the data joint extraction method, based on the domain knowledge text, after the original data are obtained, high-quality text paragraphs and common relations are reserved through data cleaning, in general, certain labeled data are inevitably needed to supervise the model, and a training set is amplified in the actual training process. Of course, the selection and tuning optimization of the model also greatly influences the determination of the final influencing factors.

The invention provides a domain map entity and relation combined extraction method based on a pre-training model, which comprises the following steps of:

step A: the method comprises the following steps of obtaining original data, dividing the data into a training set and a test set after marking, and establishing a primary small-scale insurance marking data set U and a candidate relationship set V, wherein the method specifically comprises the following steps:

step A1: capturing insurance field text information on a related website of an insurance company, cleaning and marking data, and performing data cleaning and marking according to the following steps: 3, dividing the proportion into a training set and a test set, and establishing a small-scale insurance marking data set U;

step A2: reserving common relations from the semi-structured text according to rules to form a candidate relation set V;

and B: based on a pre-training model, a relation discrimination and entity pair extraction combined learning framework is constructed, and the model is trained and tested, and the method specifically comprises the following steps:

step B1: taking the sentence as the input of a pre-training model to obtain a coding vector of the whole sequence;

step B2: judging the relation by utilizing a classifier constructed by a multi-Convolution Neural Network (CNN);

step B3: extracting all possible entity pairs according to the relation obtained by the two classifiers by using an Attention mechanism (Attention) and a long-short term memory network (LSTM);

step B4: performing combined training, calculating loss and iterating the model;

and C: screening new extracted data generated in the test process, adding the screened new extracted data into a training set, amplifying and updating the training set;

step D: repeating iteration by using the updated data set until the model is stable;

step E: and exporting and cleaning the triple data, and constructing a domain knowledge graph.

In step A1, the crawler is used to capture semi-structured data such as product introduction and comparative analysis object of a specific insurance website, and finally the semi-structured data is uniformly retained in a text form. The product introduction comprises a product name, product terms and the like; the comparison and analysis objects comprise guarantee years, a claim payment proportion, a claim free amount and the like.

And simultaneously, carrying out data cleaning and labeling, including screening of effective text paragraphs and sentence triple data labeling, and constructing a small-scale insurance labeling data set U.

In step A2, the rule refers to template matching extraction based on the template summarized manually combined with synonyms and re modules; the common relations comprise dangerous varieties, guarantee types, insurance channels, payment years, guarantee responsibilities, payment types, insurance budgets, annual insurance fees, insurance ages, waiting periods, hesitation periods, payment periods, accidents/casualties, exemption responsibilities, occupation grades, insureable occupation ranges, highest insurance amounts, special rights and interests, health notice, normal underwriting, policy years, terminal diseases, guarantee years, payment proportions, non-loss amounts and insurance companies; the candidate relation set V provides assistance for relation discrimination, and improves confidence coefficient when carrying out relation discrimination in subsequent steps, so as to avoid excessive identified relations;

in the method, the structure and the application of the model are the key points of the invention.

Specifically, for example, the data of the web page has certain rules and structures, and a specific relationship is introduced to each block under the page of a certain insurance product, including the guarantee period, the proportion of claims, the amount of exemptions, and the like. As long asThe specific part of each paragraph can be extracted according to the rule, and different rules can be given by different data formats. In step B1 of the present invention, using a Transformer-based network, the pre-training model encoding module can effectively capture context semantic information, and convert the sentence S to [ w ═ w1,…wn]N represents the length of a sentence and is used as the input of a pre-training model to obtain the characteristic vector representation of the sentence sequence; to obtain a sentence wiIs a context of (A) represents xiDifferent transform-based networks can be used, in the present invention a pre-trained model BERT is used as the basic encoder, the BERT output is as follows:

{x1,…,xn}=BERT({w1,…,wn})

here, and as is common, the feature code x for each word in a sentenceiThe corresponding mark, segment and location information is summed.

In step B2 of the present invention, a bi-classifier constructed by a multi-Convolutional Neural Network (CNN) is used to discriminate the relationship, and the relationship classification part in the bi-classifier can identify the relationship type contained in the text. Constructing a two-classifier by using a Convolutional Neural Network (CNN), wherein the output of the two-classifier is probability distribution of whether the corresponding relation is a possible relation:

P=Softmax(MaxPool(Conv(X)))

where P is the probability distribution of the output, Softmax (-) is the activation function, MaxPool (-) is the maximum pooling operation, Conv (-) is the convolution operation, and X ═ X1,…,xn]Is a coded representation of a sentence.

In step B3, all possible entity pairs are extracted according to the relationship obtained by the discriminator using Attention mechanism (Attention) and long-short term memory network (LSTM). Given a text, and the target relationship type output by the two classifiers, all possible entity pairs are extracted. Like most recognition methods, an entity is determined by identifying the beginning and ending position indices of words in text, with the following specific formula:

whereinAttention weight, d, of the current word in the sentence acquired for attention mechanismtFor the hidden state of the LSTM decoder, the model can explore all possible relationships at once, predicting all possible pairs of entities for a given relationship;

in step B4, the invention performs joint training, calculates the loss and iterates the model. And constructing the whole model into an end-to-end block mode, and realizing joint training from text input to final relation and entity pair output.

In step C, the new extracted data generated in the test process is screened and then a training set is amplified, wherein screening comprises filtering error data and screening and adding representative or first-appearing data.

In step D of the present invention, the model is retested using the updated data set, when the following two situations occur: 1) the combined loss L is less than or equal to 0.1 or the F1 score is more than or equal to 0.8; or 2) after the training data is updated, the model effect is not improved for two times continuously; stopping training when any one of the models is satisfied, wherein the model is stable and finally tends to be optimal, otherwise, continuing to train the model;

the F1 score is a measure of the classification problem and is a harmonic mean of the accuracy and the recall rate, the maximum is 1, and the minimum is 0;

the joint loss is calculated by the following formula:

L=λ·Lrel+(1-λ)Lent

wherein, the lambda is a hyper-parameter used for judging the balance relation and identifying the entity pair; l isrelIs a loss of relationship discrimination; l isentIs the loss of entity pair identification, each part is calculated following a cross-entropy loss function.

In step E, triple data are exported, a domain knowledge graph is constructed, and the triple data are expressed as < head entity, relation and tail entity >; such as < darwinian No. 3, insurance company, believable life >, < safe e life, waiting period, 30 days > etc.;

the cleaning operation means that the data are subjected to error correction, duplicate removal and denoising for better display effect and data reuse and assisted with manual processing because the extracted data always have partial data errors; the domain knowledge graph can be used for visually sensing the extraction condition of graph data under the visual condition, and further analysis is facilitated.

In the implementation process, the method is different from the prior art that the extraction of the relation and the entity is divided into two independent tasks, a novel lightweight framework is provided, a combined extraction model of the relation and the entity is established, and the method has an obvious effect on the triple extraction of the domain knowledge. Meanwhile, the existing method either does not consider the problem of entity overlapping or cannot generate all entity pairs. The method of the present invention interacts the target relationship with each word of the text to accurately generate all possible entity pairs, naturally avoiding the entity overlap problem, while extracting multiple relationships and multiple entity pairs.

The entity overlap means that one entity in a sentence can be matched to a plurality of relationships. For example, "Zongzi originates from China, whose first is Beijing". The inside of this: Zongzi-origin-China, China-capital-Beijing. "China" can be extracted repeatedly.

The invention initially uses BERT as an initial encoder to obtain context coding representation containing rich semantic information without self-training a pre-training model. The CNN can effectively discriminate the relationship without giving an excessive parameter load to the mold belt. The importance degree of all parts of the common model is the same, and the importance degree of the attribute-based model is different for different parts, so that the attribute-based model has higher adaptivity in identifying entity pairs.

Based on the method, the invention also provides a domain map entity and relationship combined extraction system based on the pre-training model, and the system comprises:

the data acquisition module is used for acquiring data information of the public insurance website, and screening and marking the data information to form a small-scale insurance marking data set U and a candidate relationship set V;

the relation judging module is used for judging the relation existing in each input sentence;

the entity pair identification module is used for identifying all entity pairs in the sentence according to the relationship obtained by judgment;

the data amplification module is used for continuously adding training data and updating a training set of the model;

and the map construction module is used for finishing triple data export and insurance map construction.

The system further comprises a BERT encoding module: the method is used for effectively capturing context semantic information and taking the sentence as the input of a pre-training model to obtain the feature vector representation of the sentence sequence.

The invention has the beneficial effects that: through data capture and cleaning and small-scale data set construction, the excessive labor cost in the initial stage can be avoided while the open domain data is acquired; model selection and combined training are realized, the ideas and methods of data mining and deep learning are utilized, the high-quality model effect can be finally obtained, and ternary group data which can be used for building a map are successfully extracted. Compared with the existing research, the method focuses more on the joint training, makes full use of the interaction between the relationship discrimination and the entity recognition, avoids splitting one problem into two independent problems, and reduces the complexity.

Compared with the prior art, the method avoids the characteristic engineering task of needing professional knowledge and expert experience, automatically extracts the ternary data by a more scientific and reasonable data-driven method, reduces the labor cost laterally, is easy to understand, and has advanced level of the prediction effect of the entity relation extraction model through verification.

The innovation point of the method is that the relation and the entity extraction are combined for learning, the semantic understanding capability of the pre-training model is fully utilized, the method can be expanded from a small amount of data, and the training set is updated in the training process to gradually improve the extraction capability of the model. Finally, experiments are carried out in the scene of actual knowledge map construction, and the effectiveness of the method is verified.

In the implementation process, the method is different from the prior art that the extraction of the relation and the entity is divided into two independent tasks, a novel lightweight framework is provided, a combined extraction model of the relation and the entity is established, and the method has an obvious effect on the triple extraction of the domain knowledge. Meanwhile, the existing method either does not consider the problem of entity overlapping or cannot generate all entity pairs. The method of the present invention interacts the target relationship with each word of the text to accurately generate all possible entity pairs, naturally avoiding the entity overlap problem, while extracting multiple relationships and multiple entity pairs.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of model extraction examples according to the present invention.

FIG. 3 is a drawing of the relationship class occupation ratio according to the present invention.

Fig. 4 is an exemplary illustration of a domain map of the present invention.

FIG. 5 is a schematic diagram of the system of the present invention.

Detailed Description

The invention is further described in detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

In the course of the implementation of the present invention,

1) evaluation indexes are as follows: the model evaluated the extraction results using standard accuracy, recall and F1 scores. When the relationship type and the entity pair are correctly identified, the triple is regarded as correctly identified, and the essence of the judgment that the identification is correct is to judge whether the classification is correct.

2) Setting parameters: word embedding uses the BERT-base pre-training model. The number of LSTM units and the number of filters used in the CNN classifier are 100, the convolution window size is 3, the next sense layer has a 100-dimensional hidden layer, and the dropout probability value is set to 0.6. The learning rate is set to 0.001. The trade-off parameter lambda in the loss function is set to 0.4. The Adam method was used to optimize the parameters during training, with a batch size of 32.

Example 1

Referring to fig. 1, a flow chart of the operation of the method of the present invention is illustrated.

The method for extracting the knowledge graph entity and the relation based on the pre-training model comprises the following steps:

(1) the method comprises the following steps of obtaining original data, dividing the data into a training set and a test set after marking, and establishing a primary small-scale insurance marking data set U and a candidate relationship set V, wherein the method specifically comprises the following steps:

(1.1) capturing insurance field text information on a related website of an insurance company, capturing product introduction and comparative analysis of a specific insurance website by using a crawler, and finally uniformly reserving the product introduction and comparative analysis into a text form;

(1.2) data cleaning, namely screening out key paragraphs from the acquired text, and removing useless information such as head and tail information, pictures and the like; and (3) small-scale marking, selecting partial representative sentence segments from the small-scale marking, manually marking, and marking according to the following steps of 7: 3, dividing the proportion into a training set and a test set, and establishing a small-scale insurance marking data set U;

(1.3) reserving common relations from the semi-structured text according to rules to form a candidate relation set V;

(2) based on a pre-training model, a joint learning framework is constructed, and the model is trained and tested, and the method specifically comprises the following steps:

and (2.1) taking the sentence as the input of a pre-training model to obtain the coding vector of the sentence sequence, wherein the adopted pre-training model is BERT Chinese.

And (2.2) judging the relation by using a classifier constructed by a multi-Convolution Neural Network (CNN), wherein the classifier is used for recognizing the relation existing in the sentence and providing a basis for the recognition of the next entity pair.

(2.3) extracting all possible entity pairs according to the relationship obtained in the last step, wherein the core is an attention module and a long-short term memory network (LSTM);

(2.4) performing combined training, calculating loss and iterating the model;

(3) screening new extracted data generated in the test process, and amplifying an updated training set;

(4) repeating the iteration with the new data set until the model is stable;

(5) and exporting and cleaning the triple data, and constructing a domain knowledge graph.

Example 2

Referring to fig. 2, a model architecture used for extracting a graph relationship and an entity pair is specifically divided into three modules:

(1) the pre-training model coding module:

the pre-training model coding module can effectively capture context semantic information and set the sentence S as [ w ═ w1,…wn]N denotes the length of the sentence, and is used as input to a pre-trained model to obtain a feature vector representation of the sentence sequence in order to obtain the sentence wiThe context of each token represents xiDifferent transform-based networks can be used, in the present invention a pre-trained model BERT (not limited to BERT) is used as the basic encoder, the BERT output is as follows:

{x1,…,xn}=BERTw1,…,wn})

here, and as is common, the feature code x for each word in a sentenceiThe corresponding mark, segment and location information is summed.

(2) A relationship discrimination module:

the relationship discrimination module is intended to identify the type of relationship contained in the text. Because the text may contain multiple relations, the relation is judged by utilizing a binary classifier constructed by a Convolutional Neural Network (CNN) inspired by the idea of multi-label classification. Given a text representation form X ∈ RnxdAnd constructing a two-classifier by using the CNN, wherein the output of the two-classifier is the probability distribution of whether the corresponding relation is a possible relation:

P=Softmax(MaxPool(Conv(X)))

where P is the probability distribution of the output, Softmax (-) is the activation function, MaxPool (-) is the maximum pooling operation, Conv (-) is the convolution operation, and X ═ X1,…,xn]Is a coded representation of a sentence;

(3) An entity identification module:

all its possible entity pairs are extracted from the resulting relationship, i.e., entity pair prediction, given a text, and the target relationship type output by the two-classifier, the predictor goal of the module is to extract all its possible entity pairs. Like most recognition methods, an entity is determined by identifying the starting and ending position indices of words in text.

WhereinAttention weight, d, of the current word in the sentence acquired for attention mechanismtFor the hidden state of the LSTM decoder, the model can explore all possible relationships at once, predicting all possible pairs of entities for a given relationship;

given a text and a target relationship type output by the relationship classifier, the variable-length entity recognition module aims to extract all its possible entity pairs in a sequential manner. Inspired by the way pointer networks find locations, the model determines an entity by identifying the starting and ending location indices of words in the text. Since entity pairs are generated from a series of indexes. Every second index may identify an entity, with every second entity forming an entity pair in order. In this paradigm, the model can explore all possible relationships at once, unlike previous work that had to predict target relationships in a multi-pass fashion.

The model first predicts all possible relationships, then for each target relationship, the principle of model processing is similar to a pointer network, sequentially generating the boundaries of all head and tail entities (i.e., the positions where the entities start and end), and finally generating all possible entity pairs as the extraction result. Therefore, for each judged candidate relationship type, only one-time relationship detection is needed to be executed, all possible entity pairs can be extracted, and the repeated process of relationship identification is avoided. Entity boundaries are sequentially generated at arbitrary positions in the text. Thus, it allows entities to freely participate in different triples.

In summary:

in the process of LSTM iteration, the hidden state h _ (t-1) of the previous layer acts on the attention network to calculate the attention weight of each position of the input sentence sequence, and the position with the maximum value is used as the pointer position of the entity to output the output of the current step t (step) to sequentially find the boundary of the entity pair.

In the module, the representation acquired by the BERT coding block is firstly passed through the attention layer to obtain a new representation, and at each position of the text, the attention mechanism can obtain a weighted value which represents the matching degree between the current feature vector and the target relation type. And the auxiliary judgment is carried out to judge whether the entity is the beginning or the end of the entity in the entity pair.

Example 3

Referring to fig. 3, the ratio of each relationship in the three sets of data extracted at last is shown.

The original text is based on related products in the insurance field, has stronger pertinence, has limited common relation types in the description of an insurance product, and can achieve better effect in the actual extraction of the model. Eventually assuming the scale case of fig. 3.

The most common relations are generally the first ten, the occurrence frequency of the relations in the future is greatly reduced, and all the relations with low occurrence frequency are classified as other relations and the occupation ratio is almost equal to the highest relation; it can be seen that when a map is constructed in a specific field, the situation that the relation types are concentrated is likely to occur, which is helpful for researchers to use the data to perform subsequent research and analysis.

Example 4

At the very beginning of the construction of the original data set, only a small scale data set is constructed in order to control the human consumption. The data cleaning work can carry out regularized processing according to the captured page information, and effective paragraphs are reserved. In the course of training the model in stages, the model is required to predict text data outside the data set range and extract triple information. In order to improve the extraction capability of the model, part of representative data is screened out after manual processing, the representative data is labeled and added into a training set, the diversity of data attributes can be increased while the data amount is amplified, and the model can be learned to have better representation capability through repeated iteration.

Finally, comparing the method provided by the invention with a Match-LSTM reference model and two relation-entity extraction models which are excellent in performance and can be used after adjustment on an independently constructed insurance field data set, as can be seen from the table 1, the method provided by the invention has obvious effect improvement.

TABLE 1

Example 5

Referring to fig. 4, an exemplary illustration of an insurance map constructed from the extracted triple data of the insurance domain is shown.

In the process of actual determination, because the domain knowledge is not limited like an open domain, in the experiment, although the extracted insurance relationships are various, most of the extracted insurance relationships are concentrated, the common insurance relationships are about dozens of the insurance relationships, the less common insurance relationships are also dozens of the insurance relationships, and the rest of the insurance relationships are few in occurrence times or are noise data.

The final experiment result shows that when the domain knowledge graph is constructed, the invention can extract the ternary group data (relation and entity pair) from the text data which is not processed completely while consuming less manpower. The model does not divide named entity recognition and relationship prediction into two independent subtasks, but considers the named entity recognition and relationship prediction as a complete extraction problem, and constructs a combined model to realize combined training. The method not only controls the manpower consumption, simplifies the process, but also obtains a more obvious effect, has more clear results and more definite relationship judgment, and is more suitable for field data.

Example 6

Referring to FIG. 5, a system of the present invention is shown.

The system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is mainly used for acquiring data information of public insurance websites and forming small-scale insurance mark data sets through screening and marking; the data of the data set enters a relation discrimination module after being coded by a pre-training model BERT, and the relation existing in each input sentence is output by the module; then entering an entity pair identification module, and identifying all corresponding entity pairs in the sentence according to the relation judged by the last module; if the expected effect is achieved, the process is stopped, and triple data are output to construct a knowledge graph; if not, adding the new data screened and labeled by the data amplification module into the training data, training the model again, and repeating the process until the termination is finished. The whole system realizes the whole process from the original data to the map, has light structure and high efficiency and conciseness of the model, and has good processing effect on the semi-structured field data.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:广电干线光缆传输系统智能运维方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!