Method for determining synonym and method for establishing synonym knowledge base

文档序号：1271730 发布日期：2020-08-25 浏览：14次中文

阅读说明：本技术 同义名称词的确定方法和同义名称词的知识库的建立方法 (Method for determining synonym and method for establishing synonym knowledge base ) 是由孙清清邹泊滔吴潇丽张天翼赵云王嘉浩沈淑钱堃王爱凌于 2020-04-30 设计创作，主要内容包括：本说明书提供了同义名称词的确定方法和同义名称词的知识库的建立方法。在一个实施例中,同义名称词的确定方法通过先获取包含有目标对象的第一名称词的第一语料数据,以及与第一语料数据关联的第二语料数据；再通过根据预设的处理规则对上述语料数据进行基于自然语言理解的多组预设处理,得到处理后的语料数据；进而可以利用上述处理后的语料数据,通过确定并利用文本数据的正则表达式,以及文本数据与目标对象的第一名称词之间的关系参数这两种不同维度的参数数据,来挖掘确定出目标对象的同义名称词。从而能够有效地避免遗漏,较为准确、全面地挖掘出目标对象的同义名称词。(The specification provides a method for determining a synonymous name and a method for establishing a knowledge base of the synonymous name. In one embodiment, the method for determining the synonymous name includes the steps of obtaining first corpus data including a first name of a target object and second corpus data associated with the first corpus data; then, carrying out multiple groups of preset processing based on natural language understanding on the corpus data according to preset processing rules to obtain processed corpus data; and then, the processed corpus data can be utilized to mine and determine the synonym name of the target object by determining and utilizing two different dimensional parameter data, namely the regular expression of the text data and the relation parameter between the text data and the first name of the target object. Therefore, omission can be effectively avoided, and the synonymous name words of the target object can be dug more accurately and comprehensively.)

1. A method of synonym determination, comprising:

acquiring a first name word, first corpus data and second corpus data of a target object, wherein the first corpus data is data containing the first name word of the target object, and the second corpus data is data associated with the first corpus data;

according to a preset processing rule, conducting multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data respectively to obtain processed first corpus data and processed second corpus data;

determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data;

and determining the synonymous name word of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name word of the target object.

2. The method of claim 1, wherein obtaining the first corpus data and the second corpus data comprises:

retrieving a preset network data source, and determining webpage data containing a first name word of a target object as the first corpus data, wherein the preset network data source comprises a plurality of sub-data sources based on different languages;

and determining the webpage data associated with the first corpus data as the second corpus data according to hyperlink data carried in the webpage data of the first corpus data.

3. The method of claim 2, after obtaining the first corpus data and the second corpus data, the method further comprising:

and performing data filtering on the first corpus data and the second corpus data to remove data of non-text data classes, so as to obtain filtered first corpus data and filtered second corpus data.

4. The method of claim 2, after obtaining the first corpus data and the second corpus data, the method further comprising:

determining a link type of hyperlink data in webpage data of the first corpus data, wherein the link type comprises at least one of the following: links among different languages, links among classes and subclasses, links among classes and explanation pages, and links among redirection pages and explanation pages;

and determining the association type between the second corpus data and the first corpus data pointed by the hyperlink data according to the link type of the hyperlink data.

5. The method according to claim 4, wherein the obtaining the processed first corpus data and the processed second corpus data by performing multiple sets of preset processing based on natural language understanding on the first corpus data and the second corpus data according to preset processing rules comprises:

respectively performing part-of-speech recognition on the first corpus data and the second corpus data, and setting corresponding part-of-speech tags for text data in the first corpus data and the second corpus data according to part-of-speech recognition results to obtain first corpus data after first preset processing and second corpus data after first preset processing;

respectively carrying out named entity object detection on the first corpus data after the first preset processing and the second corpus data after the first preset processing, and setting a named entity object tag for text data of a named entity object according to a detection result to obtain first corpus data after the second preset processing and second corpus data after the second preset processing;

and performing syntactic dependency analysis on the second preset processed first corpus data and the second preset processed second corpus data respectively, and marking syntactic dependency relations between text data in the second preset processed first corpus data and the second preset processed second corpus data according to analysis results to obtain third preset processed first corpus data and third preset processed second corpus data which are used as the processed first corpus data and the processed second corpus data.

6. The method of claim 5, the relationship parameter comprising a degree of correlation, and/or a synonymity relationship parameter.

7. The method according to claim 6, wherein determining a correlation between the text data and the first name word of the target object according to the processed first corpus data and the processed second corpus data comprises:

and determining the correlation degree between the text data in the second corpus data and the first name word of the target object according to the correlation type between the second corpus data and the first corpus data.

8. The method according to claim 6, wherein determining a synonymy parameter between the text data and the first name of the target object according to the processed first corpus data and the processed second corpus data comprises:

splitting the processed first corpus data and the processed second corpus data into a plurality of sentence data;

predicting the sentence data by using a preset relation prediction model to obtain a relation prediction result between text data in the sentence data;

and determining a synonymy relation parameter between the text data and the first name word of the target object according to the relation prediction result.

9. The method of claim 8, wherein the predetermined relationship prediction model is obtained by:

obtaining sample sentence data, and carrying out syntactic dependency analysis on the sample sentence data to obtain an analysis result;

establishing a sample syntactic dependency relationship tree for the sample sentence data according to the analysis result;

and performing model training according to the sample syntactic dependency relationship tree to obtain the preset relationship prediction model.

10. The method according to claim 1, wherein the determining a synonymous name of a target object from the first corpus data and the second corpus data according to the regular expression of the text data and a relationship parameter between the text data and a first name of the target object comprises:

determining text data matched with a regular template of a target object from the first corpus data and the second corpus data as a first-class synonym according to the regular expression of the text data;

determining text data belonging to a synonymy relation from the first corpus data and the second corpus data as a second similar synonymy name according to a relation parameter between the text data and a first name of a target object;

and determining the first type of synonymy name words and the second type of synonymy name words as synonymy name words of the target object.

11. The method of claim 1, the target object comprising a target risk object,

correspondingly, the obtaining of the first name word of the target object includes:

searching a risk list, and determining a name word recorded in the risk list and used for indicating the target risk object as a first name word of the target object, wherein the risk list comprises a plurality of risk objects.

12. The method of claim 11, further comprising:

determining the synonym name of each risk object in the plurality of risk objects contained in the risk list;

and establishing a knowledge base of the risk object synonymous name aiming at the risk list according to the synonymous name of each risk object in the risk objects.

13. The method of claim 12, after building a knowledge base of risk object synonym names for the risk list based on the synonym names of each of the plurality of risk objects, the method further comprising:

and determining whether the data object to be detected is a risk object or not according to the risk list and a knowledge base of the risk object synonymous name words aiming at the risk list.

14. The method of claim 1, wherein the first corpus data further comprises a news story comprising a first name word of a target object; correspondingly, the second corpus data further includes a news report referencing the first corpus data, and/or a news report referenced by the first corpus data.

15. A method for establishing a knowledge base of synonymous name words comprises the following steps:

acquiring a first name word, first corpus data and second corpus data of each data object in a plurality of data objects, wherein the first corpus data is data containing the first name word of the data object, and the second corpus data is data associated with the first corpus data;

determining a regular expression of the text data and a relation parameter between the text data and the first name word of each data object according to the processed first corpus data and the processed second corpus data;

according to the regular expression of the text data and the relation parameters between the text data and the first name words of each data object, mining the synonymous name words of each data object from the first corpus data and the second corpus data;

and establishing a knowledge base of the synonymous name words according to the synonymous name words of each data object.

16. The method of claim 15, the knowledge base of synonymous names comprising: the system comprises a synonym name knowledge base of a transaction risk object, a synonym name knowledge base of a public opinion concern object and a synonym name knowledge base of a lost letter object.

17. The method of claim 15, after establishing a knowledge base of synonymous names, the method further comprising:

and detecting the data object to be detected according to the knowledge base of the synonymous name words.

18. A synonym determination device comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first name word, first corpus data and second corpus data of a target object, the first corpus data is data containing the first name word of the target object, and the second corpus data is data associated with the first corpus data;

the preprocessing module is used for respectively performing multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data according to preset processing rules to obtain processed first corpus data and processed second corpus data;

the first determining module is used for determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data;

and the second determining module is used for determining the synonym name of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name of the target object.

19. A server comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 14.

20. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 14.

Technical Field

The specification belongs to the technical field of internet, and particularly relates to a method for determining synonymous names and a method for establishing a knowledge base of the synonymous names.

Background

When performing risk detection on a data object, it is often necessary to determine whether the data object has a risk by retrieving and matching the name of a risk object recorded in a risk list according to the currently used name of the data object. Often a data object may have or use multiple different names at the same time.

Therefore, a method for mining the synonymous name of the target object more accurately and comprehensively is needed.

Disclosure of Invention

The present specification provides a method for determining synonymous names and a method for establishing a knowledge base of synonymous names, so as to effectively avoid omission and accurately and comprehensively dig out the synonymous names of target objects.

The method for determining the synonymous name and the method for establishing the knowledge base of the synonymous name provided by the specification are realized as follows:

a method of synonym determination, comprising: acquiring a first name word, first corpus data and second corpus data of a target object, wherein the first corpus data is data containing the first name word of the target object, and the second corpus data is data associated with the first corpus data; according to a preset processing rule, conducting multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data respectively to obtain processed first corpus data and processed second corpus data; determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data; and determining the synonymous name word of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name word of the target object.

A method for establishing a knowledge base of synonymous name words comprises the following steps: acquiring a first name word, first corpus data and second corpus data of each data object in a plurality of data objects, wherein the first corpus data is data containing the first name word of the data object, and the second corpus data is data associated with the first corpus data; according to a preset processing rule, conducting multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data respectively to obtain processed first corpus data and processed second corpus data; determining a regular expression of the text data and a relation parameter between the text data and the first name word of each data object according to the processed first corpus data and the processed second corpus data; according to the regular expression of the text data and the relation parameters between the text data and the first name words of each data object, mining the synonymous name words of each data object from the first corpus data and the second corpus data; and establishing a knowledge base of the synonymous name words according to the synonymous name words of each data object.

A synonym determination device comprising: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first name word, first corpus data and second corpus data of a target object, the first corpus data is data containing the first name word of the target object, and the second corpus data is data associated with the first corpus data; the preprocessing module is used for respectively performing multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data according to preset processing rules to obtain processed first corpus data and processed second corpus data; the first determining module is used for determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data; and the second determining module is used for determining the synonym name of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name of the target object.

A server comprises a processor and a memory for storing processor executable instructions, wherein the processor realizes acquisition of a first name word, first corpus data and second corpus data of a target object when executing the instructions, wherein the first corpus data is data containing the first name word of the target object, and the second corpus data is data associated with the first corpus data; according to a preset processing rule, conducting multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data respectively to obtain processed first corpus data and processed second corpus data; determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data; and determining the synonymous name word of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name word of the target object.

A computer readable storage medium having stored thereon computer instructions that, when executed, implement obtaining a first name word of a target object, first corpus data, and second corpus data, wherein the first corpus data is data including the first name word of the target object, and the second corpus data is data associated with the first corpus data; according to a preset processing rule, conducting multiple groups of preset processing based on natural language understanding on the first corpus data and the second corpus data respectively to obtain processed first corpus data and processed second corpus data; determining a regular expression of the text data and a relation parameter between the text data and a first name word of the target object according to the processed first corpus data and the processed second corpus data; and determining the synonymous name word of the target object from the first corpus data and the second corpus data according to the regular expression of the text data and the relation parameter between the text data and the first name word of the target object.

The method for determining the synonymous name and the method for establishing the knowledge base of the synonymous name provided by the specification are characterized in that first corpus data containing a first name of a target object and second corpus data associated with the first corpus data are obtained; then, carrying out multiple groups of preset processing based on natural language understanding on the corpus data according to preset processing rules to obtain processed corpus data; and then, by using the processed corpus data, mining and determining the synonymous name of the target object by determining and comprehensively using two different dimensions of parameter data, namely the regular expression of the text data and the relation parameter between the text data and the first name of the target object. Therefore, omission can be effectively avoided, and the synonymous name words of the target object can be dug more accurately and comprehensively.

Drawings

In order to more clearly illustrate the embodiments of the present specification, the drawings needed to be used in the embodiments will be briefly described below, and the drawings in the following description are only some of the embodiments described in the present specification, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a diagram illustrating an embodiment of a system architecture to which a method for determining synonymous names provided in an embodiment of the present specification is applied;

FIG. 2 is a diagram illustrating an example scenario in which an embodiment of a method for determining a synonymous name provided by an embodiment of the present specification is applied;

FIG. 3 is a diagram illustrating an example scenario in which an embodiment of a method for determining a synonymous name provided by an embodiment of the present specification is applied;

FIG. 4 is a diagram illustrating an example scenario in which an embodiment of a method for determining a synonymous name provided by an embodiment of the present specification is applied;

FIG. 5 is a flowchart illustrating a method for determining synonymous terms according to an embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating a method for building a knowledge base of synonymous names provided in an embodiment of the present description;

FIG. 7 is a schematic diagram of a server according to an embodiment of the present disclosure;

fig. 8 is a schematic structural component diagram of a synonym determination device provided in one embodiment of the present specification.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

The embodiment of the present disclosure provides a method for determining a synonymous name, which can be applied to a system architecture including a first server and a second server. In particular, reference may be made to fig. 1. The first server and the second server may be connected by wire or wirelessly.

33页详细技术资料下载

Method for determining synonym and method for establishing synonym knowledge base

相关技术

网友询问留言