Active learning framework method for entity alignment

文档序号：1964430 发布日期：2021-12-14 浏览：19次中文

阅读说明：本技术 一种实体对齐的主动学习框架方法 (Active learning framework method for entity alignment ) 是由刘宇张鑫赵哲焕刘学壮陈鹏于 2021-09-17 设计创作，主要内容包括：本发明提出一种实体对齐的主动学习框架方法,从缩小实体对齐的范围出发,考虑生产环境中缺少标签数据的问题,使用多角度的两个学习引擎相互对抗和增强的方式,对模型进行训练,以完成实体对齐任务。该方法主要包括：实体分块、训练集生成、主动学习过程和候选集生成与预测。主动学习过程中包含对实体的不同侧重的模型,可以分别考虑实体的属性和关系。同时对无标签数据的指标难以测量问题,提出使用继续训练这一方式进行补足,从而满足实体对齐模型在不损失性能的同时在无标签数据的情况下的应用。(The invention provides an active learning framework method for entity alignment, which is used for training a model by using a mode that two learning engines at multiple angles resist each other and are enhanced from the viewpoint of reducing the range of entity alignment and considering the problem of label data lack in a production environment so as to complete an entity alignment task. The method mainly comprises the following steps: the method comprises the steps of entity blocking, training set generation, active learning process and candidate set generation and prediction. The active learning process includes models of different emphasis on entities, and attributes and relationships of the entities can be considered respectively. Meanwhile, for the problem that indexes of the label-free data are difficult to measure, the method of continuous training is provided for complementing, so that the application of the entity alignment model under the condition of the label-free data is met without losing the performance.)

1. An active learning framework method for entity alignment, comprising the steps of:

s1, entity blocking:

firstly, carrying out rough matching on the entity according to the relevant information of the entity on the input entity set; screening out potential matched entity pairs from all the entity pairs as candidate items; setting block functions for the block, wherein each block function determines one block, and entity pairs are stored in the blocks and may exist in different blocks simultaneously;

s2, training set generation:

selecting a part of entity pairs from each block according to the result of entity block division, generating a part of entity pairs in a random matching mode, and then forming a training set, namely a label-free data set, to be input into the active learning process by the two parts of entity pairs together;

s3, active learning process:

setting different learning engines according to different scenes, namely emphasizing different classifier models; when the relationship information of the entity is missing or sparse, two models based on the attributes are set as learning engines; when the relationship or attribute information of the entity is perfect, setting an attribute-based model and a relationship-based model as learning engines;

then, submitting some entity pairs which are most conflicted in the prediction results of the learning engine to a training set to an expert to judge whether to add the entity pairs into the labeled sample set; according to the idea of collaborative training, in the results of attribute-based model and relationship-based model prediction, directly taking the entity pair with consistent prediction as the labeling data and adding the labeling data into a labeling sample set; finally, training a learning engine and updating a training set according to the labeled sample set;

s4, candidate set generation and prediction:

determining which block is a candidate set to be predicted finally by using a branch-and-bound algorithm and taking the maximum regular coverage and the minimum data size as targets according to the partitioning of the entity in the step S1 and the labeled sample set obtained in the step S3; then, the two learning engines predict the candidate set, and the two results are merged to obtain a final prediction result;

s5, the expert examines the entity alignment result, if the result is not satisfactory, the expert returns to the active learning process of the step S1, the steps S1 to S4 are repeated, the training is continued, the learning engine loads the parameters when the last training is stopped, and the training set is consistent with the training set when the training is stopped; after some data are marked, the learning engine obtains new learning data, predicts a new candidate set and then the expert reviews the new learning data; such a process is cycled until the results are satisfactory.

2. The entity-aligned active learning framework method according to claim 1, wherein the blocking function of step S1 is implemented by a Hash function, Canopy clustering, TF-IDF, edit distance Levenshtein, or Red-Blue Set Cover algorithm.

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to an entity alignment method based on an active learning principle.

Background

In recent years, knowledge maps are applied in more and more fields, and the construction and the perfection of the knowledge maps require the integration of multi-source knowledge. Entity alignment is an important process in multi-source data fusion. When the data come from different knowledge base systems, whether the data are described by the same entity needs to be distinguished, relevant information is fused, and finally, the only entity in the target knowledge graph is generated. This is generally considered as a binary problem to find the most similar problem or to determine whether two entities are the same, and the entity name, the attributes carried by the entity, and the topology relationship information thereof, etc. can be used as useful features. Meanwhile, the number of the entities is limited through rules or other methods, and the range of the matched entities is narrowed.

Active learning is a sub-field of machine learning, also called query learning, optimal experimental design. In the whole training process, a link of manual participation exists, and proper data is screened out through a query strategy and is handed to manual marking. And active learning selects part of samples from the unlabeled sample set, and the part of samples are supplemented to the labeled sample set after labeling to continue training the model, so that the cost of manual labeling is reduced. The performance of the model can be leveled with the performance of the model trained by full-label data and is better through labeling a small amount of data, so that the cost of data labeling can be reduced through an active learning mode, and the learning capability of a relevant model can be reserved.

In the existing entity alignment research, many methods only use the attributes of the entities or only use the topological relations of the entities, and related researchers also notice that the meaning of the entities cannot be completely expressed only by using information on one hand, so that the various information of the entities is used. However, these methods require a large amount of label data to depend on the training model, so these research results have great limitations and disadvantages in practical application.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention takes the problem that the entity alignment model engineering is difficult to apply due to the lack of label data in the actual environment as a starting point, provides an active learning framework for entity alignment, and sets two independent models, wherein the two models are binary classifiers for judging whether two entities are aligned or not, and can be based on any machine learning model and a heuristic algorithm. The two models complete interaction and enhancement through an active learning process, and simultaneously increase training data by considering a collaborative training mode.

In order to achieve the above object, the present invention provides an active learning framework method for entity alignment, comprising the following steps:

s1, entity blocking:

s2, training set generation:

s3, active learning process:

setting different learning engines according to different scenes, namely emphasizing different classifier models; when the relationship information of the entity is missing or sparse, two attribute-based models are set as learning engines. When the relationship or attribute information of the entity is perfect, setting an attribute-based model and a relationship-based model as learning engines;

and then, some entities which are most conflicted in the prediction results of the learning engine on the training set are submitted to an expert to judge whether to be added into the labeled sample set. Because the learning engine needs more labeled data, if only manual labeling is adopted, the number of labeled data is less; therefore, according to the idea of collaborative training, in the results of attribute-based model and relationship-based model prediction, the entity pair with consistent prediction is directly used as the labeling data and added into the labeling sample set; and finally, training a learning engine and updating a training set according to the labeled sample set.

S4, candidate set generation and prediction:

Preferably, the blocking function in step S1 is implemented by a Hash function, a Canopy cluster, TF-IDF, an edit distance Levenshtein, or a Red-Blue Set Cover algorithm.

The invention has the beneficial effects that:

compared with the prior art, the method can perform entity alignment under the condition of no label data, and the learning engines can be replaced at the same time, so that different scenes can use different models. The method of the invention can be applied to various fields related to entity alignment.

Drawings

FIG. 1 is a general flow diagram of the active learning framework for entity alignment according to the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in fig. 1, the present invention provides an entity-aligned active learning framework, which comprises the following steps:

(1) and (3) solid blocking:

the input entity set is firstly matched with the entity roughly according to the relevant information of the entity. And screening out the potentially matched entity pair as a candidate from all the entity pairs. The calculated amount is effectively reduced by partitioning, if partitioning is not carried out, because the entities are compared pairwise to find whether the two entities can be alignedThus, O (n) is reached in time complexity²). The number of entity pairs to be calculated is reduced by setting the partitions. A block function is Set for the method, and the block function can be realized by a Hash function, a Canopy cluster, a Red-Blue Set Cover algorithm and the like. Each chunking function defines a block within which pairs of entities are stored, which pairs of entities may exist simultaneously in different blocks.

(2) Generating a training set:

according to the result of the entity block division, a part of entity pairs are selected from each block, a part of entity pairs are generated in a random matching mode, and then the two parts of entity pairs jointly form a training set, namely a label-free data set, to be input into the active learning process. The entity pairs missing in the blocks are supplemented through random matching, so that the training data is more diverse. And simultaneously ensuring that the entity pairs of the training set have no repetition.

(3) An active learning process:

setting different learning engines according to different scenes, namely emphasizing different classifier models; when the relationship information of the entity is missing or sparse, two attribute-based models are set as learning engines. When the relationship or attribute information of the entity is perfect, setting an attribute-based model and a relationship-based model as learning engines;

(4) Candidate set generation and prediction:

and (3) determining which block is the candidate set to be predicted finally by taking the maximum regular coverage range and the minimum data size as targets through a branch-and-bound algorithm according to the partitioning of the entity in the step (1) and the labeled sample set obtained in the step (3). And then, both the two learning engines predict the candidate set, and the two results are combined to obtain a final prediction result.

Because label data are lacked in the actual production environment, the quality of the prediction result cannot be judged through the corresponding indexes, and therefore a continuous training function needs to be provided to continuously train the learning engine, and a better result is guaranteed; specifically, an expert is required to review the entity alignment result, if the result is not satisfactory, the expert can return to the active learning process to continue training, the learning engine loads the parameters when the last training is stopped, and the training set is consistent with the training set when the training is stopped. After some data are marked, the learning engine obtains new learning data, predicts a new candidate set, and then the expert reviews the new learning data. Such a process is cycled until the results are satisfactory.

The invention provides an active learning framework for entity alignment, which starts from reducing the range of entity alignment, then considers the problem of label data lack in a production environment, and trains a model by using two learning engines at multiple angles in a mutual confrontation and enhancement mode so as to complete the task of entity alignment. The method mainly comprises the following steps: the method comprises the steps of entity blocking, training set generation, active learning process and candidate set generation and prediction. The active learning process includes models of different emphasis on entities, and attributes and relationships of the entities can be considered respectively. Meanwhile, for the problem that indexes of the unlabeled data are difficult to measure, the method of continuous training is provided for complementing, so that the application of the entity alignment model under the condition of the unlabeled data is met, and the performance of the model can be maintained.

Fig. 1 shows an overall flow chart of the present invention, and the main processes are described as follows:

firstly, the entity is partitioned according to the attribute, and the partitioning function adopts a Hash function, TF-IDF, an edit distance Levenshtein and Canopy clustering to partition. The Hash function is mapped by, for example: key is extracted in the modes of the first n characters, n-grams and numbers in the character string, and the like, so that entities with the same Key are divided into one block. TF-IDF is also used for extracting keywords, and the clustering of blocks is completed through the Canopy distance after the keywords are obtained by evaluating the importance degree of a certain character or word to a document. The edit distance Levenshtein is the minimum number of edits required to calculate the conversion of one character string into another character string, and after the edit distance of two character strings is obtained, the aggregation of blocks is completed through the Canopy distance. Canopy clustering is a fast clustering method, in a given set of objects, an object is randomly selected, a Canopy is created centering on the object, then the rest of the set of objects is traversed, if the distance between the current object and a center point is less than T1, the object is added to the Canopy in which the center point is located, and if the distance is less than T2, the object is deleted from the set. Finally, a set of Canopy is obtained, each Canopy containing at least one object, and each object may be in multiple Canopy, and a block can be obtained by specifying T1 and T2 as thresholds. These blocking functions set the data types that can be handled, i.e. the data types of the entities need to be specified for their attributes.

Among them, the Hash function, i.e. Hash function, is to transform an input (also called pre-mapped pre-image) of an arbitrary length into an output of a fixed length through a Hash algorithm, and the output is a Hash value. TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. Levenshtein, also known as edit distance, a similarity algorithm, refers to the minimum number of editing operations required to convert one string into another between two strings. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Proposed by the russian scientist Levenshtein. Canopy clustering, a mainstream clustering algorithm. n-grams, an algorithm based on statistical language models. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N.

If the learning engine is a deep learning model, that is, to generate an initialization vector for an entity, the initialization expression of the entity can be obtained by Bert or using a pre-trained word vector. Then, a training set is obtained through separation and random matching, and active learning is started.

Bert, Bert (Bidirectional Encoder reproduction from transformations), is a Pre-trained Language characterization model, see, e.g., Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional transformations for Language interpretation [ J ].2018.

The learning engine may utilize any binary classification algorithm, and thus takes the example of logistic regression classifiers and RDGCN as the attribute-based model and the relationship-based model, respectively. When considering the use of the relationship model, the relationship information of the entity is loaded, and a graph structure is constructed for the relationship information. The learning engine is initialized by copying some entities to form initialization data. Then, the learning engine predicts the training set respectively, selects positive examples and negative examples with higher scores respectively from the prediction results, then takes the intersection, and adds the intersection into the labeled sample set. And then selecting some entities with the largest difference between prediction results from the prediction results to be judged by experts, and adding the entities into the labeled sample set after the experts are labeled.

RDGCN, a convolutional bidirectional network, is described in the literature, Wu Y, Liu X, Feng Y, et al.relationship-Aware Entity Alignment for heterologous Graphs [ C ]// space-origin International Joint Conference on intellectual interest JCAI-19.2019.

Training the learning engine through the labeled sample set and updating the training set, continuously predicting the training set by the learning engine, and obtaining a new labeled sample through consistency judgment and manual labeling. This process is repeated until the expert's subjective decision ceases to train.

And after the labeling is stopped, selecting one or more blocks as candidate sets according to the branch-and-bound algorithm and the labeled sample set by taking the maximum positive coverage and the minimum data volume as targets, and enabling the learning engine to predict. And on the prediction result, taking the intersection of the two prediction results to ensure the accuracy, and generating a final entity alignment result.

And (4) the expert can return to the active learning process to continue training when the entity alignment result is audited by the expert, and the learning engine loads the parameters when the last training is stopped, so that the training set is consistent with the training set when the training is stopped. After some data are marked, the learning engine obtains new learning data, predicts a new candidate set, and then the expert reviews the new learning data. Such a process is cycled until the results are satisfactory.

Examples

In this embodiment, on the GTX 10808G graphics card, a deep learning framework tensoflow is employed.

Data set: experimental evaluation was performed on the processed public data set DBP 15K. The data set includes two entity sets, including 19388 and 19572 entities, respectively, for a total of 15000 pairs of entities that are aligned entities.

To demonstrate the effectiveness of both learning engines, the results are shown in Table 1 by testing the Precision index on DBP 15K.

TABLE 1

Method	Precision
		LR	35.02％
RDGCN	35.51％
		The method of the invention	71.86％

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

8页详细技术资料下载

Active learning framework method for entity alignment

相关技术

网友询问留言