Method and device for disambiguating name of learner, storage medium and terminal

文档序号：1271729 发布日期：2020-08-25 浏览：29次中文

阅读说明：本技术 学者人名的消歧方法、装置、存储介质及终端 (Method and device for disambiguating name of learner, storage medium and terminal ) 是由田欣孙虎孙沛基殷玥耿树文朱悦王茜王杨于 2020-05-12 设计创作，主要内容包括：本发明提供学者人名的消歧方法、装置、存储介质及终端。所述方法包括获取待消歧人名的论文数据集；利用词向量模型获取所述论文数据集的论文关系特征向量和论文语义特征向量；分别计算所述论文关系特征向量和论文语义特征向量的相似度矩阵,并进行特征融合,以获取特征融合矩阵；基于所述特征融合矩阵进行聚类,以获取聚类论文集和离群论文集。本发明充分利用论文信息,采用特征学习、特征融合、聚类分析等技术来实现科技文献的学者人名消歧,提高了相关评测得分和科技文献作者库检索的准确性,有助于构建一个以学者实体为核心的文献知识库。(The invention provides a method and a device for disambiguating a name of a learner, a storage medium and a terminal. The method comprises the steps of obtaining a thesis data set of a name of a person to be disambiguated; acquiring a thesis relation feature vector and a thesis semantic feature vector of the thesis data set by using a word vector model; respectively calculating similarity matrixes of the thesis relation feature vector and the thesis semantic feature vector, and performing feature fusion to obtain a feature fusion matrix; and clustering based on the feature fusion matrix to obtain a clustered thesis set and an outlier thesis set. The invention fully utilizes the thesis information, adopts the technologies of feature learning, feature fusion, cluster analysis and the like to realize the name disambiguation of the scholars of the scientific and technical literature, improves the related evaluation score and the accuracy of the library retrieval of the authors of the scientific and technical literature, and is beneficial to constructing a literature knowledge base taking the entity of the scholars as the core.)

1. A method for disambiguating names of learners, comprising:

acquiring a thesis data set of a name of a person to be disambiguated;

acquiring a thesis relation feature vector and a thesis semantic feature vector of the thesis data set by using a word vector model;

respectively calculating similarity matrixes of the thesis relation feature vector and the thesis semantic feature vector, and performing feature fusion to obtain a feature fusion matrix;

and clustering based on the feature fusion matrix to obtain a clustered thesis set and an outlier thesis set.

2. The method of claim 1, further comprising: and clustering the outlier discourse corpus, and integrating the clustering result with the clustered discourse corpus to obtain a disambiguation result of the names of the learners.

3. The method according to claim 1, wherein the obtaining a paper relationship feature vector and a paper semantic feature vector of the paper data set by using the word vector model specifically comprises:

constructing a paper heterogeneous network of a paper data set to acquire a paper relation characteristic;

preprocessing a paper text of a paper data set to acquire a paper semantic feature;

and respectively training a word vector model by using the thesis relationship characteristics and the thesis semantic characteristics to obtain the thesis relationship characteristic vector and the thesis semantic characteristic vector.

4. The method of claim 3, wherein the type of paper heterogeneous network comprises:

taking the paper as a node;

establishing an association relation between nodes through common information of the thesis; the common information comprises common authors or/and mechanisms to which the names of the people to be disambiguated exist common words;

the association relationship constructed by the common authors among the nodes is a first association relationship, and the association degree of the first association relationship and the number of the common authors are changed positively; the association relationship established between the nodes through the mechanism to which the name of the person to be disambiguated with the common words belongs is a second association relationship, and the association degree of the second association relationship and the number of the common words of the mechanism to which the name belongs are changed positively.

5. The method according to claim 4, wherein the obtaining of the thesis relationship feature comprises:

selecting a node in the thesis heterogeneous network as an initial node;

based on the incidence relation among the nodes, the initial node walks to a second node to obtain a meta-path;

based on the type of the meta path, gradually iterating to the preset number of nodes to obtain a long path;

and circularly acquiring a preset number of the long paths, and forming a path set as the thesis relation characteristics.

6. The method of claim 1, further comprising: when all words of a paper in the paper data set do not exist in the word vector model, the paper is saved in an outlier collection of papers for secondary clustering.

7. The method of claim 1, wherein the paper semantic feature vector is obtained in a manner that includes: and performing weighted calculation by using the inverse document frequency to obtain a paper semantic feature vector.

8. A device for disambiguating names of learners, comprising:

the thesis data set acquisition module is used for acquiring a thesis data set of a name of a person to be disambiguated;

the feature vector acquisition module is used for acquiring a thesis relation feature vector and a thesis semantic feature vector of the thesis data set by using a word vector model;

the feature fusion module is used for respectively calculating similarity matrixes of the thesis relation feature vector and the thesis semantic feature vector and performing feature fusion to obtain a feature fusion matrix;

and the clustering module is used for clustering based on the characteristic fusion matrix so as to obtain a clustering thesis set and an outlier thesis set.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of disambiguating a name of a learner as claimed in any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory to make the terminal execute the method for disambiguating the name of a scholars as defined in any one of claims 1 to 7.

Technical Field

The invention relates to the field of entity disambiguation, in particular to a method, a device, a storage medium and a terminal for disambiguating a name of a learner.

Background

In recent years, with the development of the internet, people have more and more convenient to acquire various information. However, in the face of huge amount of information, how to effectively screen useful contents also becomes a big problem. Among them, the search result that cannot achieve the expected effect due to the ambiguity of the name accounts for a large proportion. Therefore, how to quickly and accurately distinguish the character entities has important significance in the fields of information retrieval, information extraction, semantic knowledge base construction and the like.

For workers in the field of scientific research, the network massive scientific and technological literature knowledge base provides convenient literature retrieval and study research services for the scientific and technological workers. However, the homonymy of a large number of scholars also reduces the accuracy of the search. Therefore, the dissimilarity of the names of scholars also becomes a problem to be solved urgently in the field. The name disambiguation (NameDisambiguation) of scientific literature learners has been regarded as a challenging problem affecting aspects such as scientific literature management, people search, social network analysis, and the like, and at the same time, as the scientific literature grows in large quantities, the problem becomes increasingly difficult and urgent to solve. Therefore, the solution for exploring the entity disambiguation problem of the better knowledge base has important application value in the scientific research field, especially in the fields of information retrieval, machine reading, knowledge question answering, knowledge map and the like. The method aims at solving the problem of student renaming of the prior academic literature knowledge base, carries out disambiguation work of the names of the students, and is an important link for constructing the literature knowledge base taking an entity of the students as a core.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a method, an apparatus, a storage medium, and a terminal for disambiguating a name of a learner, which are used to solve the problems in the prior art that the disambiguating method for a name of a learner has a low evaluation score, a high algorithm implementation complexity, a low program running efficiency, and cannot run efficiently under a large data condition.

To achieve the above and other related objects, a first aspect of the present invention provides a method for disambiguating a name of a scholars, comprising: acquiring a thesis data set of a name of a person to be disambiguated; acquiring a thesis relation feature vector and a thesis semantic feature vector of the thesis data set by using a word vector model; respectively calculating similarity matrixes of the thesis relation feature vector and the thesis semantic feature vector, and performing feature fusion to obtain a feature fusion matrix; and clustering based on the feature fusion matrix to obtain a clustered thesis set and an outlier thesis set.

In some embodiments of the first aspect of the present invention, the method further comprises: and clustering the outlier discourse corpus, and integrating the clustering result with the clustered discourse corpus to obtain a disambiguation result of the names of the learners.

In some embodiments of the first aspect of the present invention, the obtaining a thesis relationship feature vector and a thesis semantic feature vector of the thesis data set by using the word vector model specifically includes: constructing a paper heterogeneous network of a paper data set to acquire a paper relation characteristic; preprocessing a paper text of a paper data set to acquire a paper semantic feature; and respectively training a word vector model by using the thesis relationship characteristics and the thesis semantic characteristics to obtain the thesis relationship characteristic vector and the thesis semantic characteristic vector.

In some embodiments of the first aspect of the present invention, the types of paper heterogeneous networks include: taking the paper as a node; establishing an association relation between nodes through common information of the thesis; the common information comprises common authors or/and mechanisms to which the names of the people to be disambiguated exist common words; the association relationship constructed by the common authors among the nodes is a first association relationship, and the association degree of the first association relationship and the number of the common authors are changed positively; the association relationship established between the nodes through the mechanism to which the name of the person to be disambiguated with the common words belongs is a second association relationship, and the association degree of the second association relationship and the number of the common words of the mechanism to which the name belongs are changed positively.

In some embodiments of the first aspect of the present invention, the obtaining of the thesis relationship feature includes: selecting a node in the thesis heterogeneous network as an initial node; based on the incidence relation among the nodes, the initial node walks to a second node to obtain a meta-path; based on the type of the meta path, gradually iterating to the preset number of nodes to obtain a long path; and circularly acquiring a preset number of the long paths, and forming a path set as the thesis relation characteristics.

In some embodiments of the first aspect of the present invention, the method further comprises: when all words of a paper in the paper data set do not exist in the word vector model, the paper is saved in an outlier collection of papers for secondary clustering.

In some embodiments of the first aspect of the present invention, the obtaining of the paper semantic feature vector includes: and performing weighted calculation by using the inverse document frequency to obtain a paper semantic feature vector.

To achieve the above and other related objects, a second aspect of the present invention provides a device for disambiguating names of scholars, comprising: the thesis data set acquisition module is used for acquiring a thesis data set of a name of a person to be disambiguated; the feature vector acquisition module is used for acquiring a thesis relation feature vector and a thesis semantic feature vector of the thesis data set by using a word vector model; the feature fusion module is used for respectively calculating similarity matrixes of the thesis relation feature vector and the thesis semantic feature vector and performing feature fusion to obtain a feature fusion matrix; and the clustering module is used for clustering based on the characteristic fusion matrix so as to obtain a clustering thesis set and an outlier thesis set.

To achieve the above and other related objects, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of disambiguating a name of a learner.

To achieve the above and other related objects, a fourth aspect of the present invention provides an electronic terminal comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method for disambiguating the names of the scholars.

As described above, the method, the apparatus, the storage medium, and the terminal for disambiguating the name of a learner according to the present invention have the following advantageous effects: the method and the device make full use of the information of the thesis and utilize means such as feature learning, feature fusion and cluster analysis to solve the problems that in the prior art, a student name disambiguation method is low in evaluation score, high in algorithm implementation complexity, low in program running efficiency and incapable of running efficiently under the condition of big data.

Drawings

Fig. 1 is a flow chart illustrating a method for disambiguating a name of a learner according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating a method for disambiguating names of multiple clustered scholars according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a disambiguation apparatus for names of scholars according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic terminal according to an embodiment of the invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," "retained," and the like are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

The invention provides a method, a device, a storage medium and a terminal for disambiguating a name of a learner, and solves the problems that in the prior art, the method for disambiguating the name of the learner has low evaluation score, high algorithm implementation complexity, low program operation efficiency and incapability of efficiently operating under the condition of big data.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention are further described in detail by the following embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：同义名称词的确定方法和同义名称词的知识库的建立方法

Method and device for disambiguating name of learner, storage medium and terminal

相关技术

网友询问留言