Method, system, device and storage medium for matching objects for articles

文档序号：1556712 发布日期：2020-01-21 浏览：2次中文

阅读说明：本技术 为文章匹配对象的方法、系统、设备及存储介质 (Method, system, device and storage medium for matching objects for articles ) 是由张亮佘志东张震涛王刚饶正锋缪世磊于 2018-06-27 设计创作，主要内容包括：本发明公开了一种为文章匹配对象的方法、系统、设备及存储介质,方法包括：获取类目下的所有对象；提取类目下的每个对象的至少一个卖点词；提取待匹配文章中的至少一个关键词；获取每个关键词对应的词向量,记为关键词词向量；获取每个卖点词对应的词向量,记为卖点词词向量；使用关键词词向量与每个对象的卖点词词向量进行相似度计算得到相应的相似度系数；计算每个关键词的TF-IDF；根据相似度系数及相应的关键词的TF-IDF计算待匹配文章与每个对象的相似分数；将相似分数最高的若干个对象作为待匹配文章最终的匹配对象。本发明实现了自动为待匹配文章精准匹配一组对象,能够显著降低人工为待匹配文章挑选对象所耗费的时间,提高自动文案生成的效率。(The invention discloses a method, a system, equipment and a storage medium for matching an article with an object, wherein the method comprises the following steps: acquiring all objects under the category; extracting at least one selling point word of each object under the category; extracting at least one keyword in an article to be matched; acquiring a word vector corresponding to each keyword, and recording the word vector as a keyword word vector; acquiring a word vector corresponding to each selling point word, and recording the word vector as a selling point word vector; performing similarity calculation by using the keyword word vectors and the selling point word vectors of each object to obtain corresponding similarity coefficients; calculating TF-IDF of each keyword; calculating the similarity score of the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword; and taking a plurality of objects with the highest similarity scores as final matching objects of the articles to be matched. The method and the device realize the automatic and accurate matching of a group of objects for the articles to be matched, can obviously reduce the time consumed by manually selecting the objects for the articles to be matched, and improve the efficiency of automatic document generation.)

1. A method for matching objects to articles, each of the articles corresponding to a respective category, the method comprising:

acquiring all objects under the category;

extracting at least one selling point word of each object under the category;

extracting at least one keyword in an article to be matched;

obtaining word vectors corresponding to each keyword in the article to be matched, and recording the word vectors as keyword word vectors;

obtaining a word vector corresponding to each selling point word of each object, and recording the word vector as a selling point word vector;

performing similarity calculation by using the keyword word vectors and the selling point word vectors of each object to obtain corresponding similarity coefficients;

calculating TF-IDF of each keyword;

calculating the similarity score of the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword;

and taking the plurality of objects with the highest similarity scores as final matching objects of the article to be matched.

2. The method of matching objects for articles of claim 1,

each selling point word of the object and the keywords in the article to be matched comprise a main word-modifier word pair and/or an independent word, the main word-modifier word pair is a word pair consisting of a main word and a corresponding modifier which appear in pairs, and the independent word is a word which independently exists except the main word and the modifier;

the obtaining of the corresponding similarity coefficient by performing similarity calculation using the keyword word vector and the selling point word vector of each object includes:

searching the article to be matched and the main words shared by each object, setting the shared main words as the same main words, and setting the rest main words as different main words;

performing cosine similarity calculation by using the word vectors of the modifiers corresponding to the same subject words of the article to be matched and the word vectors of the modifiers corresponding to the same subject words of the corresponding object to obtain a similarity coefficient A1 of the article to be matched and the same subject words corresponding to the corresponding object;

selecting at least one word with the highest TF-IDF from all the different main words of the article to be matched as a similar main word, performing distance calculation by using word vectors of the similar main words of the article to be matched and word vectors of the different main words of the corresponding object to obtain a corresponding similar main word distance A2, and performing distance calculation by using word vectors of the modifiers corresponding to the similar main words and word vectors of the modifiers corresponding to the different main words of the corresponding object to obtain a corresponding similar main word modifier distance B;

the calculating the similarity score of the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword comprises the following steps:

the TF-IDF of each identical subject word of the article to be matched is represented as W1, and the score of the identical subject word corresponding to the article to be matched and the corresponding object is set as W1A 1;

searching the independent words shared by the article to be matched and each object, setting the shared independent words as the same independent words, and setting the rest independent words as different independent words;

the TF-IDF of each identical independent word of the article to be matched is represented as V1, and the score of the article to be matched and the identical independent word of the corresponding object is set as V1;

the TF-IDF of each similar subject word of the article to be matched is represented as W2, and the score of the similar subject word corresponding to the article to be matched and the corresponding object is set as W2A 2B;

selecting at least one word with the highest TF-IDF from all the different independent words of the article to be matched as a similar independent word, performing distance calculation by using word vectors of the similar independent words of the article to be matched and word vectors of the different independent words of corresponding objects to obtain a similar independent word distance C, wherein the TF-IDF of the similar independent words of the article to be matched is represented as V2, and the score of the similar independent words corresponding to the article to be matched and the corresponding objects is set as V2C;

and calculating the similarity scores of the articles to be matched and each object according to corresponding W1A 1, V1, W2A 2B and V2C.

3. The method of matching objects for articles of claim 2 further comprising the steps of:

counting the independent words of all the objects under the category, removing words with a co-occurrence rate in a preset interval, forming a category independent word set by the remaining independent words, wherein the co-occurrence rate represents the percentage of the independent words appearing in all the objects;

removing the main word-modifier word pairs with invalid modifiers from the main word-modifier word pairs of all the objects, and forming a category main word-modifier word pair set by the remaining main word-modifier word pairs;

before the step of performing similarity calculation using the keyword word vectors and the selling point word vectors of each object to obtain corresponding similarity coefficients, the method further comprises the following steps:

cleaning said subject word-modifier word pairs of each of said objects under said category to remove said subject word-modifier word pairs from each of said objects that are not included in said set of category subject word-modifier word pairs;

cleaning the independent words of each object under the category to remove the independent words which are not included in the independent word set in each object;

cleaning the main word-modifier word pairs of the article to be matched so as to remove the main word-modifier word pairs which are not included in the category main word-modifier word pair set in the article to be matched;

and cleaning the independent words of the article to be matched so as to remove the independent words which are not included in the independent word set in the article to be matched.

4. The method as claimed in claim 2, wherein before the step of calculating the similarity score between the article to be matched and each of the objects according to the similarity coefficient and the TF-IDF of the corresponding keyword, the method further comprises the following steps:

and cleaning the object corresponding to the same main word with the similarity coefficient A1 being a negative number.

5. The method of matching objects for articles of claim 1,

the extracting at least one selling point word of each object under the category comprises the following steps:

taking the title, the attribute and the historical recommended article of the object as materials, performing word segmentation and dependency syntactic analysis on the materials to mark the relationship between each word so as to obtain at least one selling point word of each object under the category;

the extracting at least one keyword in the article to be matched comprises the following steps:

and performing word segmentation and dependency syntactic analysis on the article to be matched to mark the relationship between each word so as to obtain at least one keyword in the article to be matched.

6. The method of claim 3, wherein the predetermined interval is greater than 50% or less than 0.1%.

7. A system for matching objects to articles, each of said articles corresponding to a respective category, said system comprising:

the object acquisition module is used for acquiring all objects in the category;

a selling point word extracting module used for extracting at least one selling point word of each object under the category;

the keyword extraction module is used for extracting at least one keyword in the article to be matched;

a keyword word vector obtaining module, configured to obtain a word vector corresponding to each keyword in the article to be matched, and record the word vector as a keyword word vector;

a selling point word and word vector obtaining module, configured to obtain a word vector corresponding to each selling point word of each object, and mark the word vector as a selling point word and word vector;

the similarity calculation module is used for performing similarity calculation by using the keyword word vectors and the selling point word vectors of each object to obtain corresponding similarity coefficients;

the word frequency inverse word frequency calculation module is used for calculating TF-IDF of each keyword;

a similarity score calculation module for calculating the similarity score between the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword;

and the matching module is used for taking the objects with the highest similarity scores as final matching objects of the article to be matched.

8. The system for matching objects for articles of claim 7,

the similarity calculation module includes:

the searching module is used for searching the article to be matched and the main words shared by each object, the shared main words are set as the same main words, and the rest main words are different main words;

a cosine similarity calculation module, configured to perform cosine similarity calculation using word vectors of the modifiers corresponding to the same subject word of the article to be matched and word vectors of the modifiers corresponding to the same subject word of the corresponding object, so as to obtain a similarity coefficient a1 between the article to be matched and the same subject word corresponding to the corresponding object;

a distance calculation module, configured to select at least one word with the highest TF-IDF from all the different subject words of the article to be matched as a similar subject word, perform distance calculation using a word vector of the similar subject word of the article to be matched and a word vector of the different subject word of the corresponding object to obtain a corresponding similar subject word distance a2, and perform distance calculation using a word vector of the modifier corresponding to the similar subject word and a word vector of the modifier corresponding to the different subject word of the corresponding object to obtain a corresponding similar subject word modifier distance B;

the similarity score calculation module comprises:

a first score calculating module, configured to represent TF-IDF of each identical subject word of the article to be matched as W1, and set a score of the identical subject word corresponding to the article to be matched and the corresponding object as W1 a 1;

the independent word searching module is used for searching the independent words shared by the article to be matched and each object, the shared independent words are the same independent words, and the rest independent words are different independent words;

a second score calculating module, configured to represent TF-IDF of each identical independent word of the article to be matched as V1, and set a score of the article to be matched and the identical independent word of the corresponding object as V1;

a third score calculating module, configured to represent TF-IDF of each similar subject word of the article to be matched as W2, and set a score of the similar subject word corresponding to the article to be matched and the corresponding object as W2 a 2B;

a fourth score calculating module, configured to select at least one word with a highest TF-IDF from all the different independent words of the article to be matched as a similar independent word, perform distance calculation using a word vector of the similar independent word of the article to be matched and a word vector of the different independent word of the corresponding object to obtain a similar independent word distance C, where the TF-IDF of the similar independent word of the article to be matched is denoted as V2, and set a score of the similar independent word corresponding to the article to be matched and the corresponding object as V2C;

and the total score calculating module is used for calculating the similarity scores of the article to be matched and each object according to the corresponding W1A 1, V1, W2A 2B and V2C.

9. The system for matching objects for articles of claim 8 wherein said system further comprises:

the independent word removing module is used for counting the independent words of all the objects under the category, removing the words with the co-occurrence rate in a preset interval, forming a category independent word set by the rest independent words, and indicating the percentage of the independent words appearing in all the objects;

the word pair removing module is used for removing the main word-modifier word pairs with invalid modifiers from the main word-modifier word pairs of all the objects, and the rest main word-modifier word pairs form a category main word-modifier word pair set;

a word pair cleaning module, configured to clean the subject word-modifier word pair of each object in the category before invoking the similarity calculation module, so as to remove the subject word-modifier word pair that is not included in the category subject word-modifier word pair set in each object; cleaning the main word-modifier word pairs of the article to be matched so as to remove the main word-modifier word pairs which are not included in the category main word-modifier word pair set in the article to be matched;

an independent word cleaning module, configured to clean the independent word of each object in the category before invoking the similarity calculation module, so as to remove the independent word that is not included in the independent word set in each object; and cleaning the independent words of the article to be matched so as to remove the independent words which are not included in the independent word set in the article to be matched.

10. The system for matching objects for articles of claim 8 wherein said system further comprises:

and the object cleaning module is used for cleaning the object corresponding to the same subject word with the similarity coefficient A1 being a negative number before the similarity score calculation module is called.

11. The system for matching objects for articles of claim 7,

the selling point word extraction module is used for taking the title, the attribute and the historical recommended article of the object as materials, performing word segmentation on the materials and analyzing and marking the relation between words according to the dependency syntax so as to obtain at least one selling point word of each object under the category;

the keyword extraction module is used for performing word segmentation on the article to be matched and analyzing and marking the relation between every two words according to the dependency syntax so as to obtain at least one keyword in the article to be matched.

12. The system for matching articles as claimed in claim 9, wherein the predetermined interval is greater than 50% or less than 0.1%.

13. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for matching objects for articles of any of claims 1 to 6 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of matching objects for articles of any one of claims 1 to 6.

Technical Field

The invention relates to the technical field of internet, in particular to a method, a system, equipment and a storage medium for matching an article with an object.

Background

In the field of internet, a large number of objects need to recommend articles and the like to introduce the advantages, disadvantages, cost performance and the like of the objects for reference when selecting target groups. However, the editing of a large number of recommended articles takes a lot of time and economic cost. For this reason, automatic recommendation article generation, especially an automatic matching technology of the automatically generated article and the target object is important.

Disclosure of Invention

The invention aims to overcome the defects of large workload and low efficiency of manually matching corresponding objects for a large number of articles in the prior art, and provides a method, a system, equipment and a storage medium for matching objects for articles, which can automatically and quickly accurately match a group of objects for recommended articles.

The invention solves the technical problems through the following technical scheme:

the invention provides a method for matching articles with objects, wherein each article corresponds to a corresponding category, and the method comprises the following steps:

acquiring all objects under the category;

extracting at least one selling point word of each object under the category;

extracting at least one keyword in an article to be matched;

obtaining word vectors corresponding to each keyword in the article to be matched, and recording the word vectors as keyword word vectors;

obtaining a word vector corresponding to each selling point word of each object, and recording the word vector as a selling point word vector;

performing similarity calculation by using the keyword word vectors and the selling point word vectors of each object to obtain corresponding similarity coefficients;

calculating TF-IDF (word frequency-inverse word frequency) of each keyword;

calculating the similarity score of the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword;

and taking the plurality of objects with the highest similarity scores as final matching objects of the article to be matched.

In the scheme, hot words are extracted from all objects under the category to which the article to be matched belongs, keywords are extracted from the article to be matched, similarity calculation is performed by using the extracted word vectors, and a group of objects with the highest similarity to the article to be matched is calculated by combining TF-IDF of the keywords to serve as a final matching object.

The scheme provides a method for automatically and accurately matching a group of objects for the articles to be matched, so that the time consumed by manually selecting the objects for the articles to be matched can be obviously reduced, and the efficiency of automatic document generation is improved.

Preferably, each of the selling point words of the object and the keywords in the article to be matched includes a main word-modifier word pair and/or an independent word, the main word-modifier word pair is a word pair consisting of a main word and a corresponding modifier which appear in pairs, and the independent word is a word which exists independently except the main word and the modifier;

the obtaining of the corresponding similarity coefficient by performing similarity calculation using the keyword word vector and the selling point word vector of each object includes:

searching the article to be matched and the main words shared by each object, setting the shared main words as the same main words, and setting the rest main words as different main words;

the calculating the similarity score of the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword comprises the following steps:

and calculating the similarity scores of the articles to be matched and each object according to corresponding W1A 1, V1, W2A 2B and V2C.

Preferably, the method further comprises the steps of:

cleaning the independent words of each object under the category to remove the independent words which are not included in the independent word set in each object;

and cleaning the independent words of the article to be matched so as to remove the independent words which are not included in the independent word set in the article to be matched.

In the scheme, the co-occurrence rate represents the percentage of the number of objects in which an independent word appears in all the objects to the number of all the objects, before similarity calculation, words with the co-occurrence rate in a preset interval are removed from the independent words of all the objects under the category, and the remaining independent words form a category independent word set. Namely, words with the co-occurrence rate outside the preset interval are reserved, so that the subsequent calculation amount can be reduced, and the calculation complexity can be reduced.

In the scheme, the same principle is used for processing the main words of all objects under the category, and the word pairs corresponding to the invalid modifiers corresponding to the main words are removed, so that a category main word-modifier word pair set is obtained. The scheme can further reduce subsequent calculation amount and reduce calculation complexity.

In the scheme, a category independent word set and a category main word-modifier word pair set are used for carrying out data cleaning on each object in the category and the independent word and main word-modifier word pair corresponding to the article to be matched so as to remove words which do not belong to the set, and the remaining words are used for carrying out similarity calculation and similarity score calculation in the following process. The scheme can further reduce subsequent calculation amount and reduce calculation complexity.

Preferably, before the calculating the similarity score between the article to be matched and each object according to the similarity coefficient and the TF-IDF of the corresponding keyword, the method further comprises the following steps:

and cleaning the object corresponding to the same main word with the similarity coefficient A1 being a negative number.

In the scheme, objects which have the same main words with the article to be matched but have the modifiers corresponding to the same main words and the article to be matched with the opposite meanings are cleaned before the similarity score calculation is carried out, and the objects are not used in the subsequent similarity score calculation, so that the subsequent calculation amount is further reduced, and the calculation complexity is reduced.

Preferably, the extracting at least one selling point word of each object under the category comprises:

the extracting at least one keyword in the article to be matched comprises the following steps:

In the scheme, titles, attributes and historical recommended articles of all objects under the category of the article to be matched are taken as materials, the materials are subjected to word segmentation and dependency syntax analysis to mark the relation between words, and a main word-modifier word pair and an independent word are extracted from the materials, wherein the independent word is other words which exist independently except the main word and the modifier word. And similar extraction operations are also carried out on the article to be matched, namely, the relations between the words are marked by word segmentation and dependency syntactic analysis, and the main word-modifier word pairs and the corresponding independent words are extracted from the relations.

Preferably, the predetermined interval is greater than 50% or less than 0.1%.

The invention also provides a system for matching articles with objects, each article corresponding to a corresponding category, the system comprising:

the object acquisition module is used for acquiring all objects in the category;

a selling point word extracting module used for extracting at least one selling point word of each object under the category;

the keyword extraction module is used for extracting at least one keyword in the article to be matched;

a keyword word vector obtaining module, configured to obtain a word vector corresponding to each keyword in the article to be matched, and record the word vector as a keyword word vector;

the word frequency inverse word frequency calculation module is used for calculating TF-IDF of each keyword;

and the matching module is used for taking the objects with the highest similarity scores as final matching objects of the article to be matched.

the similarity calculation module includes:

the similarity score calculation module comprises:

and the total score calculating module is used for calculating the similarity scores of the article to be matched and each object according to the corresponding W1A 1, V1, W2A 2B and V2C.

Preferably, the system further comprises:

Preferably, the selling point word extracting module is configured to use the title, the attribute, and the history recommended article of the object as a material, perform word segmentation and dependency parsing on the material, and mark a relationship between words to obtain at least one selling point word of each object in the category;

Preferably, the predetermined interval is greater than 50% or less than 0.1%.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method for matching the article.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the aforementioned steps of the method of matching objects for an article.

The positive progress effects of the invention are as follows: the method, the system, the equipment and the storage medium for matching the articles provided by the invention respectively extract the hot words of all the objects under the category to which the articles to be matched belong, extract the keywords of the articles to be matched, then perform similarity calculation by using the extracted word vectors, and calculate a group of objects with the highest similarity with the articles to be matched as the final matching object by combining TF-IDF of the keywords. The method and the device realize the automatic and accurate matching of a group of objects for the articles to be matched, can obviously reduce the time consumed by manually selecting the objects for the articles to be matched, and improve the efficiency of automatic document generation.

Drawings

Fig. 1 is a flowchart of a method for matching an article according to embodiment 1 of the present invention.

Fig. 2 is a schematic block diagram of a system for matching an article according to embodiment 2 of the present invention.

FIG. 3 is a block diagram of the similarity calculation module shown in FIG. 2.

Fig. 4 is a block diagram of the similar score calculating module in fig. 2.

Fig. 5 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于Spark的电影推荐系统及方法

Method, system, device and storage medium for matching objects for articles

相关技术

网友询问留言