Idiom synonym list generation method and device

文档序号:1628458 发布日期:2020-01-14 浏览:22次 中文

阅读说明:本技术 一种成语同义词列表的生成方法及装置 (Idiom synonym list generation method and device ) 是由 刘晓楠 李长亮 汪美玲 郭昱 于 2019-10-08 设计创作,主要内容包括:本申请提供一种成语同义词列表的生成方法及装置,其中所述方法包括:获取用户输入的问题语句,从所述用户输入的问题语句中识别出目标成语;在预设的成语知识图谱中获取与所述目标成语具有相同特征标签的至少一个候选成语,并生成所述至少一个候选成语对应的成语推荐列表;将所述目标成语对应的词嵌入向量分别与所述成语推荐列表中的每个所述候选成语对应的词嵌入向量进行相似度计算,得到每个所述候选成语与所述目标成语对应的相似度数值;根据每个所述候选成语与所述目标成语对应的相似度数值对所述成语推荐列表中的候选成语进行筛选,得到仅包含有与所述目标成语为同义词的候选成语的成语推荐列表。(The application provides a method and a device for generating a idiom synonym list, wherein the method comprises the following steps: acquiring question sentences input by a user, and identifying target idioms from the question sentences input by the user; acquiring at least one candidate idiom with the same feature label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom; performing similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom; and screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.)

1. A method for generating idiom synonym list is characterized by comprising the following steps:

acquiring question sentences input by a user, and identifying target idioms from the question sentences input by the user;

acquiring at least one candidate idiom with the same feature label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom;

performing similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom;

and screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.

2. The method of claim 1, wherein after obtaining the idiom recommendation list containing only candidate idioms that are synonyms of the target idiom, further comprising:

and returning the idiom recommendation list containing the candidate idioms which are synonyms with the target idiom to the user.

3. The method of claim 1, prior to obtaining the question statement input by the user, further comprising:

acquiring structured data from a preset corpus database, wherein the structured data comprises a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relation information among the idiomatic entities and tag relation information among the idiomatic entities and the feature tags;

and constructing a idiom knowledge graph according to the structured data so that the idiom knowledge graph comprises idiom entities with semantic relations, attributes corresponding to each idiom entity and at least one feature tag.

4. The method of claim 3, after constructing a linguistic knowledge graph from the structured data, further comprising:

and acquiring word embedding vectors corresponding to each idiom entity in the idiom knowledge graph from a preset Chinese character word and sentence embedding corpus.

5. The method of claim 3, wherein the obtaining of the user-entered question sentence and the identifying of the target idiom from the user-entered question sentence comprises:

acquiring a question sentence input by a user, performing Chinese word segmentation on the question sentence, and acquiring text data corresponding to a target idiom in the question sentence;

and acquiring idiom entities matched with the text data corresponding to the target idiom from the corpus database based on the text data corresponding to the target idiom and a pattern matching algorithm so as to identify the target idiom.

6. The method of claim 5, wherein said obtaining at least one candidate idiom having the same feature label as the target idiom in a predetermined idiom knowledge graph comprises:

determining at least one feature tag corresponding to the target idiom in the idiom knowledge graph;

and acquiring at least one idiom entity with the completely same feature tag as the target idiom in the idiom knowledge graph as a candidate idiom based on the at least one feature tag corresponding to the target idiom.

7. The method of claim 4, wherein the calculating the similarity between the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list to obtain the similarity value between each candidate idiom and the target idiom comprises:

determining word embedding vectors corresponding to the target idioms and word embedding vectors corresponding to each candidate idiom in the idiom recommendation list based on the Chinese character word and sentence embedding corpus;

and respectively calculating the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom based on a similarity algorithm.

8. The method of claim 7, wherein the screening candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom comprises:

comparing the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom with a similarity threshold value, and judging whether the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is larger than or equal to the similarity threshold value or not;

under the condition that the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is larger than or equal to the similarity threshold value, reserving the candidate idiom in the idiom recommendation list;

and removing the candidate idioms from the idiom recommendation list under the condition that the cosine similarity of the word embedding vector corresponding to the target idioms and the word embedding vector corresponding to the candidate idioms is smaller than the similarity threshold value.

9. The method of claim 8, wherein the similarity threshold is 0.9.

10. An apparatus for generating a list of idiom synonyms, comprising:

the idiom recognition module is configured to acquire question sentences input by a user and recognize target idioms from the question sentences input by the user;

the list generation module is configured to acquire at least one candidate idiom with the same feature label as the target idiom in a preset idiom knowledge graph and generate an idiom recommendation list corresponding to the at least one candidate idiom;

the similarity calculation module is configured to perform similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom;

and the list screening module is configured to screen the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.

11. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-9 when executing the instructions.

12. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 9.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a synonym list, a computing device, and a computer-readable storage medium.

Background

The existing network idiom dictionary mainly provides information such as reading, paraphrasing, origin, similar meaning words, antisense words and the like of idioms, usually adopts relational database organization and storage, and can provide a user with a use mode of related synonyms on the basis that: searching a specific idiom, checking related information of the idiom, and then comparing the related idiom with the explanation of the specific idiom by a user by opening a near-synonym link provided by returned information to judge whether the related idiom and the explanation of the specific idiom are synonyms. Meanwhile, the traditional Chinese synonym technology is mainly applied to the fields of information retrieval, foreign Chinese teaching, professional vocabulary and the like, and most of the related idioms are artificial labeled near synonym relations containing partial synonym relations.

Generally, when a user needs to search synonyms of a specific idiom during writing, the user needs to switch to a tool such as a search or dictionary of a third party, however, at present, the tool mainly supports that related idiom information is returned for the input idiom, only near-meaning word links with similar meanings of the idiom can be provided, synonym information with the same semantics of the idiom is not provided, the user needs to open links containing the near-meaning words in the idiom information, the original idiom and the paraphrases of the near-meaning words are compared, whether the two are in synonym relationship is judged, and therefore the user needs to perform more discrimination and screening on the idiom returned by the tool, continuity of the user about document writing thinking is greatly damaged, difficulty of obtaining required information by the user is improved, and accuracy of obtaining required information by the user is reduced.

Disclosure of Invention

In view of this, embodiments of the present specification provide a method, an apparatus, a computing device, and a computer-readable storage medium for generating a synonym list, so as to solve technical defects in the prior art.

According to a first aspect of embodiments of the present specification, there is provided a method for generating a idiom synonym list, including:

acquiring question sentences input by a user, and identifying target idioms from the question sentences input by the user;

acquiring at least one candidate idiom with the same feature label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom;

performing similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom;

screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom;

according to a second aspect of embodiments of the present specification, there is provided an apparatus for generating a list of idiom synonyms, including:

the idiom recognition module is configured to acquire question sentences input by a user and recognize target idioms from the question sentences input by the user;

the list generation module is configured to acquire at least one candidate idiom with the same feature label as the target idiom in a preset idiom knowledge graph and generate an idiom recommendation list corresponding to the at least one candidate idiom;

the similarity calculation module is configured to perform similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom;

the list screening module is configured to screen the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom;

according to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the method for generating the idiomatic synonym list when executing the instructions.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the idiom synonym list generating method.

Aiming at pain spots which are difficult to distinguish by slight differences between synonyms and near-synonyms in the writing process of a user, the method and the device ensure that the generalization of idiomorphs and the calculation of the similarity between idiomorph embedding vectors provide accurate synonyms for the user by utilizing feature labels in an idiom knowledge map, and can return idiom recommendation lists formed by idioms mutually replaced with target idioms under any condition, so that the user can directly ask questions in a writing tool without switching to a third-party tool, the user does not need to distinguish synonyms or near-synonyms for the generated idiom recommendation lists, and does not need to judge the feasibility of mutual replacement between idioms, thereby shortening the path for selecting idioms and ensuring the accuracy of idiom selection.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a synonym list according to an embodiment of the present disclosure;

FIG. 3 is another flowchart of a method for generating a synonym list according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a method for generating a synonym list according to an embodiment of the present disclosure;

FIG. 5 is another flowchart of a method for generating a synonym list according to an embodiment of the present disclosure;

FIG. 6 is another flowchart of a method for generating a synonym list according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a device for generating a idiom synonym list according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Knowledge graph: the knowledge graph aims to describe various entities or concepts existing in the real world and relations thereof, and forms a huge semantic network graph, wherein nodes represent the entities or concepts, and edges are formed by attributes or relations.

Entity: an entity refers to something that is distinguishable and exists independently, such as a person's name, a city name, a plant name, a commodity name, and the like, and is the most basic element in a knowledge graph, and different relationships exist among different entities.

The attributes are as follows: an attribute value pointing to it from an entity, different attribute types corresponding to edges of different types of attributes, an attribute mainly referring to characteristic information of an object, such as "area", "population", "capital" are several different attributes, and an attribute value mainly referring to a value of an attribute, such as 960 ten thousand square kilometers, etc.

The relationship is as follows: on a knowledge graph, a relationship is a function that maps several graph nodes (entities, semantic classes, attribute values) to boolean values.

Triplet: triples are a general representation of knowledge graph, and the basic form of triples mainly includes (head entity-relationship-tail entity) and (entity-attribute value).

Pattern matching algorithm: the pattern matching is a basic operation of character strings in a data structure, one substring is given, all substrings which are the same as the substring are required to be found in a certain character string, if P is the given substring, T is the character string to be found, all substrings which are the same as P are required to be found from T, the problem becomes a pattern matching problem, P is called a pattern, T is called a target, if one or more substrings of which the pattern is P exist in T, the position of the substring in T is given, and the matching is successful; otherwise the match fails. There are many pattern matching algorithms, among which the more well-known ones are: KMP algorithm, BM algorithm, Sunday algorithm and Horspool algorithm.

Morphemes: morphemes are the smallest phonetic and semantic associations and the smallest meaningful units of language. Morphemes are not language units that are used independently, and their primary function is to serve as the material that constitutes a word. It is a combination of speech and semantic meaning, meaning language unit, and is aimed at distinguishing it from syllable, some syllables are light, sound and meaningless, and can not be regarded as morphemes, for example "fog" and "wonders". It is said to be the smallest meaningful language unit, not an independently applied language unit, in order to distinguish it from words.

In the present application, a method, an apparatus, a computing device, and a computer-readable storage medium for generating a list of idiom synonyms are provided, which are described in detail in the following embodiments one by one.

FIG. 1 shows a block diagram of a computing device 100, according to an embodiment of the present description. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 100 and other components not shown in FIG. 1 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart showing a generating method of idiom synonym list according to an embodiment of the present application, including steps 201 to 210.

Step 202: the method comprises the steps of obtaining question sentences input by a user, and identifying target idioms from the question sentences input by the user.

In one or more embodiments of the present application, when a user needs to search for a synonym of a specific target idiom during text input through a terminal device, the user may directly ask a question of the system in a writing tool, the system may acquire a question sentence input by the user, and identify the target idiom that the user wants to search for the synonym from the question sentence input by the user, for example, when the synonym idiom of the target idiom "darkness display bin" needs to be searched for and replaced, the user may input a question sentence "synonym of the darkness display bin" to ask the system, and the system may acquire the question sentence "synonym of the darkness display bin" and identify the target idiom "darkness display bin" from the question sentence "synonym of the darkness display bin".

Step 204: and acquiring at least one candidate idiom with the same characteristic label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom.

In one or more embodiments of the present application, a system constructs a idiom knowledge graph by using a idiom knowledge graph construction method based on feature tags, after obtaining a target idiom asked by a user, the system matches at least one candidate idiom having a feature tag identical to that of the target idiom from the idiom knowledge graph through the feature tag already labeled in the idiom knowledge graph by the target idiom, generates an idiom recommendation list corresponding to the at least one candidate idiom to ensure that the target idiom is identical to a main morphism of the candidate idiom, and distinguishes a synonym from a synonym by a link between the feature tags, for example, for a target idiom "dark level bin" having a synonym such as "light darkness" or "dark autumn wave" in a question sentence input by the user, also have synonyms such as "steal day change" or "graft wood," and have the antisense words "open-reading" and "fire-holding," but only the feature labels between synonyms are identical, there will be one or more identical feature labels between synonyms rather than identical.

Step 206: and respectively carrying out similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list to obtain a similarity value corresponding to each candidate idiom and the target idiom.

In one or more embodiments of the present application, the system performs similarity calculation on the target idiom and each candidate idiom in the idiom recommendation list by using the word embedding vector, so as to measure the similarity between each candidate idiom and the target idiom.

Step 208: and screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.

In one or more embodiments of the present application, the system filters candidate idioms in the idiom recommendation list according to the degree of similarity between each candidate idiom and the target idiom, and removes suspected synonyms with the degree of similarity not meeting the requirement from the idiom recommendation list, thereby obtaining an idiom recommendation list only containing candidate idioms that are synonymous with the target idiom.

In the above embodiment, after obtaining the idiom recommendation list including only candidate idioms that are synonyms of the target idiom, the method further includes:

and returning the idiom recommendation list containing the candidate idioms which are synonyms with the target idiom to the user.

In one or more embodiments of the present application, after generating a idiom recommendation list only including candidate idioms that are synonymous with the target idiom, the system returns the idiom recommendation list to the user, so that the user can obtain candidate idiom information that is synonymous with the target idiom.

Aiming at pain spots which are difficult to distinguish by slight differences between synonyms and near-synonyms in the writing process of a user, the method and the device ensure that the generalization of idiomorphs and the calculation of the similarity between idiomorph embedding vectors provide accurate synonyms for the user by utilizing feature labels in an idiom knowledge map, and can return idiom recommendation lists formed by idioms mutually replaced with target idioms under any condition, so that the user can directly ask questions in a writing tool without switching to a third-party tool, the user does not need to distinguish synonyms or near-synonyms for the generated idiom recommendation lists, and does not need to judge the feasibility of mutual replacement between idioms, thereby shortening the path for selecting idioms and ensuring the accuracy of idiom selection.

In the above embodiment, as shown in fig. 3, before acquiring the question statement input by the user, the method further includes steps 302 to 306:

step 302: structured data are obtained from a preset corpus database, and the structured data comprise a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relation information among the idiomatic entities and tag relation information among the idiomatic entities and the feature tags.

In one or more embodiments of the present application, the system may obtain structured data from an existing corpus database, such as a web encyclopedia, a web dictionary, or a specialized database, where the structured data includes a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relationship information between the plurality of idiomatic entities, and tag relationship information between the idiomatic entities and the feature tags, where the semantic relationship information includes synonym relationships, near-synonym relationships, and anti-synonym relationships, among others.

Step 304: and constructing a idiom knowledge graph according to the structured data so that the idiom knowledge graph comprises idiom entities with semantic relations, attributes corresponding to each idiom entity and at least one feature tag.

In one or more embodiments of the present application, as shown in fig. 4, synonym relationships, and synonym relationships are included in the constructed idiomatic knowledge map, assuming that idiomatic entity a, idiomatic entity B, idiomatic entity C, and idiomatic entity D are idiomatic entities in the idiomatic knowledge map, and that idiomatic entity a is synonym in the idiomatic entity B, idiomatic entity a is synonym in the idiomatic entity C, and idiomatic entity a is antisense in the idiomatic entity D, the idiomatic entity a and idiomatic entity B should have identical feature labels, such as "darkness" and "plain darkness", since the two synonyms are similar in meaning but often different in the field of adjective, the idiomatic entity a and idiomatic entity C have at least one identical feature label, such as "dark store" and "flower-graft".

Step 306: and acquiring word embedding vectors corresponding to each idiom entity in the idiom knowledge graph from a preset Chinese character word and sentence embedding corpus.

In one or more embodiments of the present application, word embedding vectors corresponding to chinese words and phrases including idioms, which are trained in advance through a model, are already stored in an existing chinese word and sentence embedding corpus, and a system can load the word embedding vectors corresponding to all limited idioms entities in the idioms knowledge graph for subsequent similarity calculation.

The idiom knowledge map is constructed through structured data, synonyms and similar synonyms are distinguished based on the feature labels, and a user is supported to acquire idiom information from multiple sides.

Fig. 5 illustrates a method for generating a idiom synonym list, which is described by taking the generation of the idiom synonym list as an example, and includes steps 502 to 516, according to an embodiment of the present specification.

Step 502: structured data are obtained from a preset corpus database, and the structured data comprise a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relation information among the idiomatic entities and tag relation information among the idiomatic entities and the feature tags.

In one or more embodiments of the present application, the system may obtain structured data from an existing corpus database, such as a web encyclopedia, a web dictionary, or a specialized database, where the structured data includes a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relationship information between the plurality of idiomatic entities, and tag relationship information between the idiomatic entities and the feature tags, where the semantic relationship information includes synonym relationships, near-synonym relationships, and anti-synonym relationships, among others.

Step 504: and constructing a idiom knowledge graph according to the structured data so that the idiom knowledge graph comprises idiom entities with semantic relations, attributes corresponding to each idiom entity and at least one feature tag.

In one or more embodiments of the present application, as shown in fig. 4, synonym relationships, near synonym relationships, and anti-synonym relationships exist in the constructed idiomatic knowledge map, and assuming that idiomatic entity a, idiomatic entity B, idiomatic entity C, and idiomatic entity D are idiomatic entities in the idiomatic knowledge map, and if idiomatic entity a and idiomatic entity B are in synonym relationships, idiomatic entity a and idiomatic entity C are in near synonym relationships, and idiomatic entity a and idiomatic entity D are in anti-synonym relationships, the idiomatic entity a and idiomatic entity B should have identical feature labels.

Step 506: the method comprises the steps of obtaining a question sentence input by a user, carrying out Chinese word segmentation on the question sentence, and obtaining text data corresponding to a target idiom in the question sentence.

In one or more embodiments of the present application, after a system obtains a question sentence input by a user, the system performs word segmentation on the question sentence through a chinese word segmentation technique of natural language processing, so as to extract a target idiom from the question sentence, and obtain a substring corresponding to the target idiom, that is, text data.

Step 508: and acquiring idiom entities matched with the text data corresponding to the target idiom from the corpus database based on the text data corresponding to the target idiom and a pattern matching algorithm so as to identify the target idiom.

In one or more embodiments of the present application, based on a pattern matching algorithm, the system matches the substrings corresponding to the target idioms with the corpus database as a target, and searches for the target idioms in the corpus database to identify the target idioms.

Step 510: and determining at least one characteristic label corresponding to the target idiom in the idiom knowledge graph.

In one or more embodiments of the present application, after determining the target idiom, the system further determines at least one feature tag corresponding to the target idiom through an idiom knowledge graph, wherein the feature tag is marked and manually checked, and is used for marking attributes or description information of the target idiom, for example, the meaning of the idiom "dark storehouse" is "confusing enemies from the front side to cover up own attack routes, and making sudden attacks from flanks, which is a strategy of sourdough, odd defeat, and quote, and means confusing each other with obvious actions, making strategies unavailable for people, and performing activities in the metaphors. "then its signature tags may include" military "," strategy ", and" dark ", among others.

Step 512: and acquiring at least one idiom entity with the same feature label as the target idiom in the idiom knowledge graph as a candidate idiom based on at least one feature label corresponding to the target idiom, and generating an idiom recommendation list corresponding to the at least one candidate idiom.

In one or more embodiments of the present application, since each idiom entity in the idiom knowledge graph has been set with a plurality of feature tags, the system only needs to match in the idiom knowledge graph through the feature tags, thereby obtaining at least one idiom entity having a feature tag that is identical to the target idiom as a candidate idiom, and generating an idiom recommendation list corresponding to the at least one candidate idiom, so as to ensure that the selected candidate idiom is identical to the main morphism of the target idiom.

Step 514: and respectively carrying out similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list to obtain a similarity value corresponding to each candidate idiom and the target idiom.

In one or more embodiments of the present application, the system performs similarity calculation on the target idiom and each candidate idiom in the idiom recommendation list by using the word embedding vector, so as to measure the similarity between each candidate idiom and the target idiom.

Step 516: and screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.

In one or more embodiments of the present application, the system filters candidate idioms in the idiom recommendation list according to the degree of similarity between each candidate idiom and the target idiom, and removes suspected synonyms with the degree of similarity not meeting the requirement from the idiom recommendation list, thereby obtaining an idiom recommendation list only containing candidate idioms that are synonymous with the target idiom.

The method and the device distinguish the synonyms from the near synonyms by utilizing the relation between the idioms and the corresponding feature labels, so that the synonyms required by the user are distinguished, the confusing near synonyms are filtered, and the candidate idioms in the idiom recommendation list can be mutually exchanged with the target idioms in any context.

Fig. 6 illustrates a method for generating a idiom synonym list, which is described by taking the generation of the idiom synonym list as an example, and includes steps 602 to 620, according to an embodiment of the present specification.

Step 602: structured data are obtained from a preset corpus database, and the structured data comprise a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relation information among the idiomatic entities and tag relation information among the idiomatic entities and the feature tags.

In one or more embodiments of the present application, the system may obtain structured data from an existing corpus database, such as a web encyclopedia, a web dictionary, or a specialized database, where the structured data includes a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relationship information between the plurality of idiomatic entities, and tag relationship information between the idiomatic entities and the feature tags, where the semantic relationship information includes synonym relationships, near-synonym relationships, and anti-synonym relationships, among others.

Step 604: and constructing a idiom knowledge graph according to the structured data so that the idiom knowledge graph comprises idiom entities with semantic relations, attributes corresponding to each idiom entity and at least one feature tag.

In one or more embodiments of the present application, as shown in fig. 4, synonym relationships, near synonym relationships, and anti-synonym relationships exist in the constructed idiomatic knowledge map, and assuming that idiomatic entity a, idiomatic entity B, idiomatic entity C, and idiomatic entity D are idiomatic entities in the idiomatic knowledge map, and if idiomatic entity a and idiomatic entity B are in synonym relationships, idiomatic entity a and idiomatic entity C are in near synonym relationships, and idiomatic entity a and idiomatic entity D are in anti-synonym relationships, the idiomatic entity a and idiomatic entity B should have identical feature labels.

Step 606: and acquiring word embedding vectors corresponding to each idiom entity in the idiom knowledge graph from a preset Chinese character word and sentence embedding corpus.

In one or more embodiments of the present application, word embedding vectors corresponding to chinese words and phrases including idioms, which are trained in advance through a model, are already stored in an existing chinese word and sentence embedding corpus, and a system can load the word embedding vectors corresponding to all limited idioms entities in the idioms knowledge graph for subsequent similarity calculation.

Step 608: the method comprises the steps of obtaining question sentences input by a user, and identifying target idioms from the question sentences input by the user.

In one or more embodiments of the present application, when a user needs to search for a synonym of a specific target idiom during text input through a terminal device, the user may directly ask a question of the system in a writing tool, the system may acquire a question sentence input by the user, and identify the target idiom that the user wants to search for the synonym from the question sentence input by the user, for example, when the synonym idiom of the target idiom "darkness display bin" needs to be searched for and replaced, the user may input a question sentence "synonym of the darkness display bin" to ask the system, and the system may acquire the question sentence "synonym of the darkness display bin" and identify the target idiom "darkness display bin" from the question sentence "synonym of the darkness display bin".

Step 610: and acquiring at least one candidate idiom with the same characteristic label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom.

In one or more embodiments of the application, a system adopts a idiom knowledge graph construction method based on feature tags to construct and obtain an idiom knowledge graph, and after the system obtains a target idiom asked by a user, the system matches at least one candidate idiom with completely the same feature tag as the target idiom from the idiom knowledge graph through the feature tag marked by the target idiom in the idiom knowledge graph, and generates an idiom recommendation list corresponding to the at least one candidate idiom.

Step 612: and determining word embedding vectors corresponding to the target idioms and word embedding vectors corresponding to each candidate idiom in the idiom recommendation list based on the Chinese character word and sentence embedding corpus.

In one or more embodiments of the present application, after being loaded, the system determines, from the word embedding vectors corresponding to all of the limited idiomatic entities in the idiom knowledge graph, a word embedding vector corresponding to the target idiom and a word embedding vector corresponding to each of the candidate idioms in the idiom recommendation list.

Step 614: and respectively calculating the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom based on a similarity algorithm.

In one or more embodiments of the present application, based on a similarity algorithm, a system respectively calculates cosine similarity between a word embedding vector corresponding to the target idiom and a word embedding vector corresponding to each candidate idiom, where the cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of a difference between two individuals, and the cosine similarity still maintains "1 when the cosine similarity is the same, 0 when the cosine similarity is orthogonal, and-1 when the cosine similarity is opposite" in a high-dimensional situation, and compared with distance measurement, the cosine similarity emphasizes a difference between the two vectors in a direction rather than a distance or a length, and has the following formula:

Figure BDA0002225695570000151

particularly, because the cosine similarity measures the included angle of the space vector, and reflects the difference in direction rather than position, the situation that the cosine similarity is high but two idioms are antisense words also exists, and therefore, the feature tag is required to ensure that the main morphemes between the candidate idioms and the target idioms are the same.

Step 616: and comparing the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom with a similarity threshold, and judging whether the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is larger than or equal to the similarity threshold. If yes, go to step 618, otherwise go to step 620.

Step 618: and reserving the candidate idioms in the idiom recommendation list.

In one or more embodiments of the present application, when the cosine similarity between the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is greater than or equal to the similarity threshold, the degree of similarity between the target idiom and the candidate idiom is considered to be high, and the target idiom and the candidate idiom may be determined to be synonyms, so that the candidate idiom is retained in the idiom recommendation list.

Step 620: and removing the candidate idioms from the idiom recommendation list.

In one or more embodiments of the present application, when the cosine similarity between the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is smaller than the similarity threshold, it is determined that the degree of similarity between the target idiom and the candidate idiom is weak, and it is not enough to determine that the target idiom and the candidate idiom are synonyms, so that the candidate idiom is removed from the idiom recommendation list.

Alternatively, the similarity threshold may be 0.9.

The cosine similarity of the target idiom and each candidate idiom is calculated by using the word embedding vector, and the cosine value of the included angle between two vectors in the vector space is used for measuring the difference between two idiom entities, so that whether the two idioms are synonyms or not is accurately and reliably judged.

Corresponding to the above method embodiments, the present specification further provides an embodiment of a device for generating a synonym list of idioms, and fig. 7 shows a schematic structural diagram of the device for generating a synonym list of idioms according to an embodiment of the present specification. As shown in fig. 7, the apparatus includes:

a idiom recognition module 701 configured to acquire question sentences input by a user and recognize target idioms from the question sentences input by the user;

a list generating module 702, configured to obtain at least one candidate idiom having the same feature label as the target idiom in a preset idiom knowledge graph, and generate an idiom recommendation list corresponding to the at least one candidate idiom;

a similarity calculation module 703 configured to perform similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list, so as to obtain a similarity value corresponding to each candidate idiom and the target idiom;

the list screening module 704 is configured to screen the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom, so as to obtain an idiom recommendation list only including the candidate idioms which are synonyms of the target idioms.

Optionally, the apparatus further comprises:

and the list returning module is configured to return the idiom recommendation list containing the candidate idioms which are synonyms of the target idioms to the user.

Optionally, the apparatus further comprises:

the data acquisition module is configured to acquire structural data from a preset corpus database, wherein the structural data comprises a plurality of idiomatic entities, a plurality of feature tags, idiomatic attribute information, semantic relation information among the idiomatic entities and tag relation information among the idiomatic entities and the feature tags;

and the map construction module is configured to construct a idiom knowledge map according to the structured data, so that the idiom knowledge map comprises idiom entities with semantic relations, attributes corresponding to each idiom entity and at least one feature tag.

Optionally, the apparatus further comprises:

and the word vector loading module is configured to acquire a word embedding vector corresponding to each idiom entity in the idiom knowledge map from a preset Chinese character word and sentence embedding corpus.

Optionally, the idiom recognition module includes:

the word segmentation unit is configured to acquire a question sentence input by a user, perform Chinese word segmentation on the question sentence, and acquire text data corresponding to a target idiom in the question sentence;

and the keyword searching unit is configured to acquire idiom entities matched with the text data corresponding to the target idiom in the corpus database based on the text data corresponding to the target idiom and a pattern matching algorithm so as to identify the target idiom.

Optionally, the list generating module includes:

a label determining unit configured to determine at least one feature label corresponding to the target idiom in the idiom knowledge graph;

and the tag matching unit is configured to acquire at least one idiom entity with the completely same feature tag as the target idiom in the idiom knowledge graph as a candidate idiom based on at least one feature tag corresponding to the target idiom.

Optionally, the similarity calculation module includes:

a word vector determining unit configured to determine, based on the Chinese character word and sentence embedding corpus, a word embedding vector corresponding to the target idiom and a word embedding vector corresponding to each candidate idiom in the idiom recommendation list;

and the cosine similarity calculation unit is configured to calculate the cosine similarity of the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom respectively based on a similarity calculation method.

Optionally, the list screening module includes:

a threshold comparison unit, configured to compare the cosine similarity between the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom with a similarity threshold, and determine whether the cosine similarity between the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to the candidate idiom is greater than or equal to the similarity threshold; if yes, executing the retention unit, and if not, executing the removal unit;

a retaining unit configured to retain the candidate idioms in the idiom recommendation list;

a removing unit configured to remove the candidate idiom from the idiom recommendation list.

Optionally, the similarity threshold is 0.9

Aiming at pain spots which are difficult to distinguish by slight differences between synonyms and near-synonyms in the writing process of a user, the method and the device ensure that the generalization of idiomorphs and the calculation of the similarity between idiomorph embedding vectors provide accurate synonyms for the user by utilizing feature labels in an idiom knowledge map, and can return idiom recommendation lists formed by idioms mutually replaced with target idioms under any condition, so that the user can directly ask questions in a writing tool without switching to a third-party tool, the user does not need to distinguish synonyms or near-synonyms for the generated idiom recommendation lists, and does not need to judge the feasibility of mutual replacement between idioms, thereby shortening the path for selecting idioms and ensuring the accuracy of idiom selection.

An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor executes the instructions to implement the following steps:

acquiring question sentences input by a user, and identifying target idioms from the question sentences input by the user;

acquiring at least one candidate idiom with the same feature label as the target idiom from a preset idiom knowledge graph, and generating an idiom recommendation list corresponding to the at least one candidate idiom;

performing similarity calculation on the word embedding vector corresponding to the target idiom and the word embedding vector corresponding to each candidate idiom in the idiom recommendation list respectively to obtain a similarity value corresponding to each candidate idiom and the target idiom;

and screening the candidate idioms in the idiom recommendation list according to the similarity degree value corresponding to each candidate idiom and the target idiom to obtain an idiom recommendation list only containing the candidate idioms which are synonyms with the target idiom.

An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, the method for generating the idiom synonym list as described above is implemented.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the computer-readable storage medium and the technical solution of the above method for generating the idiom synonym list belong to the same concept, and details that are not described in detail in the technical solution of the computer-readable storage medium can be referred to the description of the technical solution of the above method for generating the idiom synonym list.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于试卷的英语作文自动评阅的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!