Method and device for selecting articles

文档序号：749451 发布日期：2021-04-23 浏览：28次中文

阅读说明：本技术 选择物品的方法和装置 (Method and device for selecting articles ) 是由岳俊杰于 2019-10-23 设计创作，主要内容包括：本发明公开了选择物品的方法和装置,涉及计算机技术领域。该方法的一具体实施方式包括：根据第一文本得到第一属性名集合和第一属性名集合的属性值,根据第二文本得到第二属性名集合和第二属性名集合的属性值；根据第一属性名集合、第一属性名集合的属性值、第二属性名集合和第二属性名集合的属性值,确定第一文本描述的物品和第二文本描述的物品的相似度；若相似度大于第一预设值,则根据用户关注属性,从所述第一文本描述的物品和所述第二文本描述的物品中,选择目标物品。该实施方式提高了用户体验度。(The invention discloses a method and a device for selecting articles, and relates to the technical field of computers. One embodiment of the method comprises: obtaining attribute values of a first attribute name set and a first attribute name set according to the first text, and obtaining attribute values of a second attribute name set and a second attribute name set according to the second text; determining the similarity of the article described by the first text and the article described by the second text according to the first attribute name set, the attribute value of the second attribute name set and the attribute value of the second attribute name set; and if the similarity is greater than a first preset value, selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user. The embodiment improves the user experience.)

1. A method of selecting an item, comprising:

obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text;

determining the similarity of the object described by the first text and the object described by the second text according to the first attribute name set, the attribute value of the first attribute name set, and the attribute values of the second attribute name set and the second attribute name set;

and if the similarity is larger than a first preset value, selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user.

2. The method of claim 1, wherein the category to which the first text belongs is the same as the category to which the second text belongs;

before obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, the method comprises the following steps:

creating an item attribute library for the category;

and obtaining a first attribute name set and an attribute value of the first attribute name set according to a first text and obtaining a second attribute name set and an attribute value of the second attribute name set according to a second text by adopting the category object attribute library.

3. The method of claim 2, wherein creating the library of item attributes for the category comprises:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring an attribute value of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute values of the key attribute name set.

4. The method according to claim 3, wherein obtaining a first attribute name set and attribute values of the first attribute name set from a first text using the item property library for categories comprises:

selecting a first attribute name set from the key attribute name set according to a first text;

acquiring an undetermined attribute value of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

and performing stop word removal, uniform representation mode and attribute value splitting processing on the undetermined attribute values of the first attribute name set to obtain the attribute values of the first attribute name set.

5. The method of claim 3, wherein obtaining the set of target property names for the category comprises:

acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts in the category;

and obtaining a target attribute name set of the category according to the attribute name corresponding to the attribute with the occurrence frequency larger than a second preset value.

6. The method of claim 2, wherein determining the similarity of the first text-described item and the second text-described item according to the first set of attribute names, the attribute values of the second set of attribute names, and the attribute values of the second set of attribute names comprises:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one same attribute name;

for each identical attribute name, acquiring an attribute value of the identical attribute name from the attribute value of the first attribute name set and the attribute value of the second attribute name set respectively, and performing similarity calculation on the acquired attribute values to obtain the similarity of the identical attribute names;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

7. The method according to claim 6, wherein performing similarity calculation on the obtained attribute values to obtain the similarity of the same attribute name comprises:

acquiring a plurality of positive examples of the category; wherein, the positive example comprises a plurality of articles, and the attribute value of each article is the same;

recombining the articles included in the positive examples to obtain a plurality of negative examples, and deleting the negative examples with the mutual exclusion relationship of the article attribute values from the negative examples to obtain a plurality of negative examples of the categories;

training an edit distance algorithm by adopting the plurality of positive examples of the category and the plurality of negative examples of the category to obtain the edit distance algorithm of the category;

and performing similarity calculation on the obtained attribute values by adopting the edit distance algorithm of the category to obtain the similarity of the same attribute name.

8. An apparatus for selecting an item, comprising:

the acquisition unit is used for acquiring a first attribute name set and attribute values of the first attribute name set according to a first text and acquiring a second attribute name set and attribute values of the second attribute name set according to a second text;

a processing unit, configured to determine, according to the first attribute name set, the attribute value of the first attribute name set, and the attribute values of the second attribute name set, a similarity between the item described in the first text and the item described in the second text;

and the selecting unit is used for selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user if the similarity is greater than a first preset value.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for selecting articles.

Background

Currently, the process of selecting an item includes: and determining whether the articles described by different texts are similar based on a neural network method, and selecting a target article from the articles after determining that the articles are similar.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the neural network-based method is to embed high-dimensional information into a low-dimensional space, and to express the high-dimensional information by using a vector with a lower dimension, and at present, no clear theory is provided to ensure that the original high-dimensional information can be kept and details of how to distinguish the high-dimensional information, such as attribute values of articles, can be kept after embedding the low-dimensional space. Therefore, based on the neural network method, the accuracy of determining whether the objects are similar is not high, the selection of the target object is meaningless, even if the target object is selected, the target object does not accord with the requirements of the user, and the user experience is not high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for selecting an article, which can improve user experience.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of selecting an article.

The method for selecting the article comprises the following steps:

In one embodiment, the category to which the first text belongs is the same as the category to which the second text belongs;

before obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, the method comprises the following steps:

creating an item attribute library for the category;

In one embodiment, creating an item property library for the category comprises:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring an attribute value of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute values of the key attribute name set.

In one embodiment, obtaining a first attribute name set and attribute values of the first attribute name set according to a first text by using the item attribute library for the category includes:

selecting a first attribute name set from the key attribute name set according to a first text;

acquiring an undetermined attribute value of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

In one embodiment, obtaining the set of target attribute names for the category includes:

acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts in the category;

and obtaining a target attribute name set of the category according to the attribute name corresponding to the attribute with the occurrence frequency larger than a second preset value.

In one embodiment, determining a similarity between the first text-described item and the second text-described item according to the first set of attribute names, the attribute values of the first set of attribute names, and the attribute values of the second set of attribute names comprises:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one same attribute name;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

In one embodiment, the performing similarity calculation on the obtained attribute values to obtain the similarity of the same attribute name includes:

acquiring a plurality of positive examples of the category; wherein, the positive example comprises a plurality of articles, and the attribute value of each article is the same;

and performing similarity calculation on the obtained attribute values by adopting the edit distance algorithm of the category to obtain the similarity of the same attribute name.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided an apparatus for selecting an article.

The device for selecting articles of the embodiment of the invention comprises:

In one embodiment, the category to which the first text belongs is the same as the category to which the second text belongs;

the acquisition unit is used for:

before obtaining a first attribute name set and an attribute value of the first attribute name set according to a first text, creating an article attribute library of the category;

In one embodiment, the obtaining unit is configured to:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring an attribute value of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute values of the key attribute name set.

In one embodiment, the obtaining unit is configured to:

selecting a first attribute name set from the key attribute name set according to a first text;

acquiring an undetermined attribute value of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

In one embodiment, the obtaining unit is configured to:

acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts in the category;

and obtaining a target attribute name set of the category according to the attribute name corresponding to the attribute with the occurrence frequency larger than a second preset value.

In one embodiment, the processing unit is to:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one same attribute name;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

In one embodiment, the processing unit is to:

acquiring a plurality of positive examples of the category; wherein, the positive example comprises a plurality of articles, and the attribute value of each article is the same;

and performing similarity calculation on the obtained attribute values by adopting the edit distance algorithm of the category to obtain the similarity of the same attribute name.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

An electronic device of an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for selecting the article provided by the embodiment of the invention.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a computer-readable medium.

A computer-readable medium of an embodiment of the present invention has a computer program stored thereon, and the computer program, when executed by a processor, implements the method for selecting an article provided by an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: obtaining attribute values of a first attribute name set and a first attribute name set according to the first text, and obtaining attribute values of a second attribute name set and a second attribute name set according to the second text; determining the similarity of the article described by the first text and the article described by the second text according to the first attribute name set, the attribute value of the second attribute name set and the attribute value of the second attribute name set; and if the similarity is greater than a first preset value, selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user. Whether the attribute values of the articles are similar or not is determined, so that whether the articles described by different texts are similar or not is determined, the accuracy of determining whether the articles are similar or not is improved, the selected target article is more consistent with the requirements of users, and the user experience is improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of a method of selecting an item according to an embodiment of the invention;

FIG. 2 is an application scenario of a method of selecting an item according to an embodiment of the present invention;

FIG. 3 is an example of attributes in a method of selecting an item according to an embodiment of the present invention;

FIG. 4 is an example of processing a first text in a method of selecting an item according to an embodiment of the present invention;

FIG. 5 is an example of a verification of a method of selecting an item according to an embodiment of the present invention;

FIG. 6 is an example of another authentication of a method of selecting an item according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the main units of an apparatus for selecting items according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the information explosion era, people are eagerly hoped to obtain contents with high matching degree with self needs from massive information. In order to meet the requirement, various applications such as a search engine, an automatic question-answering system, document classification and clustering, document duplication, document accurate pushing and the like appear, and one of key technologies of the application scenarios is text similarity calculation. Currently, text similarity calculation methods fall into 4 categories:

1 String-based method

The method starts from character string matching degree, and takes character string co-occurrence and repetition degree as a similarity measurement standard. Methods can be classified into a Character-Based (Character-Based) method and a word-Based (Term-Based) method according to the calculation granularity. One class of methods considers similarity algorithms, such as edit distance, hamming distance, cosine similarity, Dice coefficient, euclidean distance, simply from the composition of the characters or words; another type of method also adds a requirement that the character composition and the character order are the same and similar, such as Longest Common Substring (LCS), Jaro-Winkler; the other method adopts a set idea, the character strings are regarded as a set formed by words, and the word co-occurrence can be calculated by the intersection of the set, such as N-gram, Jaccard and overlay Cooefficient.

Method based on Corpus (Corpus-based)

Corpus-based methods compute text similarity using information obtained from a corpus. Corpus-based methods can be divided into: bag of words model based methods, neural network based methods and search engine based methods. The first two uses the document sets with similarity to be compared as a corpus, and the latter uses Web as a corpus.

a bag of words based method

Bag of Words (BOW) Model is based on the distribution hypothesis, i.e. the context of the Words is similar, and the semantics thereof are similar. The basic idea is to represent a document as a combination of a series of terms, regardless of the order in which the terms appear in the document. The method based on the bag-of-words Model mainly includes Vector Space Model (VSM), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA), according to the degree of Semantic consideration.

b neural network based method

The calculation of text similarity by generating Word vectors (Word vectors) through a neural network model is a method which is researched more in the natural language processing field in recent years. Many models and tools for generating Word vectors are also proposed, such as Word2Vec and GloVe. The essence of the word vector is a low-dimensional real number vector trained from an unlabeled unstructured text, and the expression mode enables similar words to be closer in distance, and simultaneously well solves the problems of dimension disaster and insufficient semantics of a bag-of-words model due to independent words.

c search engine based approach

Since the Normalized Google Distance (NGD) proposed by Cilibrasi et al, methods for computing semantic similarity based on search engines have begun to be popular. The basic principle is that given search keywords x and y, a search engine returns the number f (x) and f (y) of web pages containing x and y and the number f (x and y) of web pages containing x and y at the same time, and the Google similarity distance is calculated.

3 world Knowledge-based method

The method based on world knowledge is to calculate text similarity by using a knowledge base with a standard organization system, and is generally divided into two types: ontology-based knowledge and network-based knowledge. The former generally utilizes the superior-inferior and co-location relationships between concepts in an ontology architecture, and if the concepts are semantically similar, there is one and only one path between the two concepts. In the network knowledge, the entries are structured, and the upper and lower relations among the entries are shown in a hyperlink mode, so that the information organization mode is closer to the understanding of a computer. The paths between concepts or links between terms become the basis for text similarity calculation.

4 other methods

Other methods include syntactic analysis, which is a syntactic analysis of a sentence, which is a kind of semantic analysis, and hybrid methods, which do not depend on a corpus or world knowledge, and are classified into other methods. The mixing method is a combination of several methods.

Determining whether items described in different texts are similar, including similarity at the SKU level and similarity at the SPU level, the present invention is primarily directed to a SKU level scenario. The text similarity calculation method is applied to the scene, and has the following problems:

1. the main problems of the character string method at present are as follows:

the method based on character strings is text comparison on a literal level, and text representation is original text. The method is simple in principle and easy to implement, and is the calculation basis of other methods. But the disadvantage is that the characters or words are used as independent knowledge units, and the meaning of the words and the relation between the words are not considered. Synonyms are used as examples, and have the same meaning despite different expressions, and the similarity of such words cannot be accurately calculated by means of a character string-based method. Thus, the user experience is poor.

2. The corpus-based method has the following difficulties:

1) firstly, the method carries out similarity calculation based on characteristic items in a text, and when the characteristic items are more, the calculation efficiency is not high due to a generated high-dimensional sparse matrix; the second is the assumption of the vector space model algorithm that the feature items extracted from the text are not associated, which does not conform to the semantic expression of the text. Thus, the accuracy of determining whether items are similar is not high, and the user experience is poor.

2) The neural network based approach has been introduced in the background art and will not be described herein.

3) The biggest defect of the similarity calculation method based on the search engine is that the calculation result completely depends on the query effect of the search engine, and the similarity is different from search engine to search engine. Currently, common search engines do not support the same item search to return description information of the same item for all merchants. Thus, this method is not suitable for application to the above-described scenarios.

3. The disadvantages of the world knowledge based approach:

the text similarity calculation method based on network knowledge mostly utilizes page links or hierarchical structures, and can better reflect the semantic relation of entries. But the disadvantages are that: the information completeness difference between the entries is large, and the calculation accuracy cannot be guaranteed; the network knowledge is generated in a mode of mass participation, so that the text lacks certain expertise. Thus, this method is not suitable for application to the above-described scenarios.

4. The major disadvantages of the other methods available

The key point of syntactic analysis is to find the dependency relationship or semantic relationship of each part in a sentence, and consider word similarity and relationship similarity while calculating similarity, so the method has richer semantics, but the complexity of the sentence itself brings difficulty and workload for frame analysis, and the difficulty and workload are not small. The obvious difference between the article description text and the conventional text is that the article description is relatively structured information, and the text of the article description does not have a strict syntactic structure, for example, titles are almost randomly arranged according to attribute values, and the semantic meaning is not affected by exchanging the sequence, but the matching effect is greatly different. Thus, this method is not suitable for application to the above-described scenarios.

In order to solve the problems in the prior art, an embodiment of the present invention provides a method for selecting an article, as shown in fig. 1, the method including:

step S101, obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text.

The detailed description of the step is provided below and will not be repeated herein.

Step S102, determining the similarity of the object described by the first text and the object described by the second text according to the first attribute name set, the attribute value of the first attribute name set, and the attribute values of the second attribute name set and the second attribute name set.

The detailed description of the step is provided below and will not be repeated herein.

And S103, if the similarity is larger than a first preset value, selecting a target article from the articles described in the first text and the articles described in the second text according to the attention attribute of the user.

In this embodiment, the first preset value may be set according to requirements, for example, 0.8. It should be noted that the user attention attribute may be value, appearance, taste, performance, quality, or the like. The following describes selecting a target item with a specific example: and selecting an item with a low bid value from the items described in the first text and the items described in the second text as a target item.

It should be understood that the first text and the second text are different but both texts, and the texts may be item detail pages. The article can be a computer, a mobile phone, clothes, food or the like. If the item is a food, the attribute names may be food type, delivery time, service attitude rating, food flavor, and the like.

In the embodiment of the invention, the category to which the first text belongs is the same as the category to which the second text belongs;

before step S101, the method includes:

creating an item attribute library for the category;

step S101 may include:

In this embodiment, an article attribute library is created, and an article attribute library is adopted to obtain a first attribute name set, an attribute value of the first attribute name set, and an attribute value of a second attribute name set. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, the accuracy of determining whether the articles are similar or not is improved, the selected target article is more consistent with the requirements of the user, and the user experience degree is further improved.

In the embodiment of the present invention, as shown in fig. 2, creating the item attribute library of the category includes:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring an attribute value of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute values of the key attribute name set.

In this embodiment, normalizing the target attribute name set includes: and if the coincidence proportion of the attribute values of any two target attribute names in the target attribute name set is greater than the preset proportion, the two target attribute names are considered as the same attribute name, and one of the two target attribute names is deleted. The preset ratio may be set according to the requirement, for example, 0.6.

The number of the target attribute names in the target attribute name set is generally 5, and of course, the number can be adjusted at any time according to requirements.

As shown in fig. 3, the set of key attribute names includes color, version, and type. The attribute values of the key attribute name set include attribute values of color, attribute values of version, and attribute values of type. Attribute values of the color include blue, black, white, yellow, coral, and red; the attribute values of the versions comprise 64GB, 128GB and 256 GB; the attribute values of the types include open edition, mobile 4G preferred edition, and annualized repair.

It should be understood that the attribute values of the key attribute name set may be obtained from all the texts under the category, or may be obtained by searching on a search engine.

In addition, the article attribute library for each category can be created in the manner provided by the embodiment of the invention.

In this embodiment of the present invention, as shown in fig. 4, obtaining a first attribute name set and an attribute value of the first attribute name set according to a first text by using the item attribute library for categories includes:

selecting a first attribute name set from the key attribute name set according to a first text;

acquiring an undetermined attribute value of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

In this embodiment, in implementation, if a key attribute name exists in the first text, the key attribute name is used as the first attribute name.

As shown in fig. 2 and 4, the item attribute library of the category may include a deactivation thesaurus, a replacement rule library, an upper and lower relationship library, a synonym library, a mutual exclusion library, and the like.

The stop word stock is used for removing stop words, and the stop words comprise other stop words, high-grade stop words and the like.

The upper and lower relational databases and the synonym database are used for unifying the representation modes. The processing of the thesaurus is described below with a specific example: unifying the attribute values of the terylene into polyester fiber. In order to ensure the effect of the embodiment of the invention, the synonym library should have better attribute value completeness. The upper and lower relational databases are expression systems in which attribute values are unified into the lowest order. For example, the attribute value of the air conditioner type, i.e., home appliance, is unified as an air conditioner.

The replacement rule base is used for attribute value splitting. For example, the attribute value of the mobile phone network type, namely, the full-network 4G, is split into the mobile 4G, the unicom 4G and the telecom 4G.

The mutual exclusion library is used for deleting negative cases of mutual exclusion relation existing in the attribute values of the articles. For example, the mutual exclusion library includes a first item and a second item, wherein the first item attribute value and the second item attribute value have a mutual exclusion relationship, and if the negative example includes the first item and the second item, the negative example is directly deleted.

It should be noted that, through the item attribute library of the category, the complexity of determining whether items described in different texts are similar is reduced.

In the embodiment, a first attribute name set is selected from the key attribute name sets through a first text, undetermined attribute values of the first attribute name set are obtained from the first text, and stop word removal, uniform representation mode and attribute value splitting processing are performed on the attribute values to be determined, so that attribute values of the first attribute name set are obtained. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, the accuracy of determining whether the articles are similar or not is improved, the selected target article is more consistent with the requirements of the user, and the user experience degree is further improved. In addition, the problem of poor user experience in selecting articles based on a character string method is solved through unified representation modes.

In the embodiment of the present invention, obtaining a second attribute name set and an attribute value of the second attribute name set according to a second text by using the item attribute library for categories includes:

selecting a second attribute name set from the key attribute name set according to a second text;

acquiring an undetermined attribute value of the second attribute name set from the second text according to the attribute values of the second attribute name set and the key attribute name set;

and performing stop word removal, uniform representation mode and attribute value splitting processing on the undetermined attribute values of the second attribute name set to obtain the attribute values of the second attribute name set.

It should be understood that, the obtaining of the attribute values of the second attribute name set and the second attribute name set is similar to the obtaining of the attribute values of the first attribute name set and the first attribute name set, and the specific implementation manner of this embodiment may refer to the previous embodiment, which is not described herein again.

In the embodiment of the present invention, obtaining the target attribute name set of the category includes:

acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts in the category;

and obtaining a target attribute name set of the category according to the attribute name corresponding to the attribute with the occurrence frequency larger than a second preset value.

In this embodiment, the second preset value is set according to the requirement, for example, 50. If so, manually rechecking the obtained target attribute name set of the category, and obtaining a key attribute name set according to the rechecked target attribute name set of the category.

In this embodiment, the target attribute name set of the category is obtained through the above process, and the number of the target attribute names in the target attribute name set is small and has relevance, so that the problem of poor user experience in selecting an article based on the vector space model is solved.

In this embodiment of the present invention, step S102 may include:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one same attribute name;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

In this embodiment, a specific example is described below:

the first attribute name set includes an attribute name one, an attribute name two, and an attribute name three.

The second attribute name set includes an attribute name one, an attribute name two, and an attribute name four.

Thus, the intersection includes: attribute name one and attribute name two.

For the first attribute name, acquiring an attribute value of the first attribute name from the attribute values of the first attribute name set, namely, acquiring an attribute value I; and acquiring an attribute value of the attribute name one, namely an attribute value two from the attribute values of the second attribute name set.

For the second attribute name, acquiring an attribute value of the second attribute name from the attribute values of the first attribute name set, namely an attribute value III; and acquiring an attribute value of the second attribute name, namely an attribute value four from the attribute values of the second attribute name set.

Similarity calculation is carried out on the attribute value I and the attribute value II to obtain the similarity of the attribute name I;

similarity calculation is carried out on the attribute value three and the attribute value four, and the similarity of the attribute name two is obtained;

and fusing the similarity of the first attribute name and the similarity of the second attribute name, wherein the fused result is used as the similarity of the object described by the first text and the object described by the second text.

In specific implementation, a logistic regression classification model (i.e., a logistic regression classification model) or a vector space model can be adopted to fuse the obtained similarities.

And (3) adopting a logistic regression classification model for fusion, wherein the accuracy rate and the recall rate of the similarity are shown in fig. 5, and adopting a vector space model for fusion, and the accuracy rate and the recall rate of the similarity are shown in fig. 6. Wherein the abscissa in fig. 5 and 6 represents the similarity and the ordinate represents the accuracy and recall. As can be seen from fig. 5 and 6, the accuracy and the recall rate of the embodiment of the present invention are better; in addition, the vector space model is adopted for fusion, so that the effect is better.

In this embodiment, similarity calculation is performed on the attribute values to obtain similarity of the same attribute name, the obtained similarities are fused, and the fusion result is used as the similarity between the article described in the first text and the article described in the second text. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, the accuracy of determining whether the articles are similar or not is improved, the selected target article is more consistent with the requirements of the user, and the user experience degree is further improved.

In the embodiment of the present invention, the calculating the similarity of the obtained attribute values to obtain the similarity of the same attribute name includes:

acquiring a plurality of positive examples of the category; wherein, the positive example comprises a plurality of articles, and the attribute value of each article is the same;

and performing similarity calculation on the obtained attribute values by adopting the edit distance algorithm of the category to obtain the similarity of the same attribute name.

In this embodiment, the multiple positive examples of the category are manually labeled, and the server applied in the embodiment of the present invention obtains the multiple positive examples of the category by a manual input method.

The following describes a plurality of positive examples of the classes by a specific example:

(A₁₁，A₁₂，…，B₁₁，B₁₂，…)；

(A₂₁，A₂₂，…，B₂₁，B₂₂，…)；

…；

(A_m1，A_m2，…，B_m1，B_m2，…)。

a bracket represents a positive example, each article in each positive example is the same, A represents a Jingdong platform, B represents other platforms, a first subscript mark represents an article number, and a second subscript mark represents a merchant to which the article belongs. It should be noted that the number of items in different examples may be the same or different.

The articles included in the case are reorganized in a criss-cross manner, e.g. in the form of (A)₁₁，A₁₂，…，B₁₁，B₁₂A11 and (A) in …)₂₁，A₂₂，…，B₂₁，B₂₂And …) were interchanged to obtain two negative examples. In addition, the crossing may be a random crossing.

The following describes a negative example of deleting an attribute value of an item with a mutual exclusion relationship by using a specific example:

example 1: negative examples include items of different brands. And deleting the negative case by adopting the exclusive library of the category.

Example 2: negative examples include 3 items, where 1 item is a woman and the other 2 items are men. And deleting the negative case by adopting the exclusive library of the category.

In addition, the embodiment can improve the classification accuracy of the positive examples and the negative examples. For example, the positive case accuracy rate of 33.07% and the negative case accuracy rate of 94.35% are improved to be the positive case accuracy rate of 93.85% and the negative case accuracy rate of 87.69%.

It should be noted that the edit distance algorithm may be replaced by a term frequency-inverse text frequency index (TF-IDF). Dedupe (an open source framework for entity matching) two text similarity methods, one is an edit distance algorithm, and the other is TF-IDF.

It should be appreciated that determining the similarity of the first text-described item to the second text-described item is essentially a matter of two text string similarity calculations.

Due to synonymy, missing or ambiguity of the attribute name and the attribute value of the article, even multiple attribute values, the labeling cost of the positive example is high.

The true number of positive examples, the predicted number of positive examples, the true number of negative examples, and the predicted number of negative examples are shown in table 1:

TABLE 1 quantitative relationship table

The accuracy of the positive example was 0.98% (0.98% ═ 9900/(9900 + 1000000)), the recall rate of the positive example was 99% (99% ═ 9900/10000), and the accuracy of the negative example was 99%.

In summary, in the application scenario of the embodiment of the present invention, the number of negative examples is significantly higher than that of positive examples, and the actual application of the positive examples is almost ineffective. Thus, to ensure user experience, the number of positive examples and the number of negative examples should be balanced. By deleting the negative cases of the attribute values of the articles with the mutual exclusion relationship, the number of the negative cases is reduced, and the quality of the negative cases is ensured.

In this embodiment, the negative examples of the category are obtained by deleting the negative examples of the item attribute value having the mutual exclusion relationship. Thereby reducing the number of negative examples and improving the quality. And training the edit distance algorithm by adopting a plurality of positive examples of the categories and a plurality of negative examples of the categories to obtain the edit distance algorithm of the categories. And similarity calculation is performed by adopting a category edit distance algorithm, so that the similarity calculation accuracy is higher, the selected target object is more consistent with the user requirement, and the user experience is further improved.

A method of selecting an article is described above in connection with fig. 1-6, and an apparatus for selecting an article is described below in connection with fig. 7.

In order to solve the problems in the prior art, an embodiment of the present invention provides an apparatus for selecting an article, as shown in fig. 7, the apparatus including:

the obtaining unit 701 is configured to obtain a first attribute name set and an attribute value of the first attribute name set according to a first text, and obtain a second attribute name set and an attribute value of the second attribute name set according to a second text.

A processing unit 702, configured to determine, according to the first attribute name set, the attribute value of the first attribute name set, and the attribute values of the second attribute name set, the similarity between the item described in the first text and the item described in the second text.

A selecting unit 703 is configured to select, according to the user attention attribute, a target article from the articles described in the first text and the articles described in the second text if the similarity is greater than a first preset value.

In the embodiment of the invention, the category to which the first text belongs is the same as the category to which the second text belongs;

the obtaining unit 701 is configured to:

before obtaining a first attribute name set and an attribute value of the first attribute name set according to a first text, creating an article attribute library of the category;

In this embodiment of the present invention, the obtaining unit 701 is configured to:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring an attribute value of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute values of the key attribute name set.

In this embodiment of the present invention, the obtaining unit 701 is configured to:

selecting a first attribute name set from the key attribute name set according to a first text;

acquiring an undetermined attribute value of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

In this embodiment of the present invention, the obtaining unit 701 is configured to:

acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts in the category;

and obtaining a target attribute name set of the category according to the attribute name corresponding to the attribute with the occurrence frequency larger than a second preset value.

In this embodiment of the present invention, the processing unit 702 is configured to:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one same attribute name;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

In this embodiment of the present invention, the processing unit 702 is configured to:

acquiring a plurality of positive examples of the category; wherein, the positive example comprises a plurality of articles, and the attribute value of each article is the same;

and performing similarity calculation on the obtained attribute values by adopting the edit distance algorithm of the category to obtain the similarity of the same attribute name.

It should be understood that the functions performed by the components of the device for selecting an article according to the embodiments of the present invention have been described in detail in the method for selecting an article according to the above embodiments, and are not described in detail herein.

Fig. 8 illustrates an exemplary system architecture 800 of a method of selecting an item or an apparatus for selecting an item to which embodiments of the invention may be applied.

As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The terminal devices 801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the terminal devices 801, 802, 803. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the method for selecting an item provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the apparatus for selecting an item is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a processing unit, and a selection unit. For example, the selection unit may be further described as a unit that selects a target item from the items described in the first text and the items described in the second text according to the user attention attribute if the similarity is greater than a first preset value.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text; determining the similarity of the object described by the first text and the object described by the second text according to the first attribute name set, the attribute value of the first attribute name set, and the attribute values of the second attribute name set and the second attribute name set; and if the similarity is larger than a first preset value, selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user.

According to the technical scheme of the embodiment of the invention, the attribute values of a first attribute name set and a first attribute name set are obtained according to a first text, and the attribute values of a second attribute name set and a second attribute name set are obtained according to a second text; determining the similarity of the article described by the first text and the article described by the second text according to the first attribute name set, the attribute value of the second attribute name set and the attribute value of the second attribute name set; and if the similarity is greater than a first preset value, selecting a target article from the articles described by the first text and the articles described by the second text according to the attention attribute of the user. Whether the attribute values of the articles are similar or not is determined, so that whether the articles described by different texts are similar or not is determined, the accuracy of determining whether the articles are similar or not is improved, the selected target article is more consistent with the requirements of users, and the user experience is improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

23页详细技术资料下载

Method and device for selecting articles

相关技术

网友询问留言