Object keyword extraction method and device

文档序号:1953441 发布日期:2021-12-10 浏览:13次 中文

阅读说明:本技术 对象的关键词提取方法及装置 (Object keyword extraction method and device ) 是由 王艳花 张晓辉 李志鹏 李瑶 张光宇 于 2020-09-03 设计创作,主要内容包括:本申请实施例提供一种对象的关键词提取方法及装置,该方法包括:获取第一对象对应的文本信息,文本信息用于描述第一对象。根据文本信息,确定第一对象对应的多个候选关键词。根据多个候选关键词的相似度及候选关键词为关键词的概率,在多个候选关键词中确定第一对象的至少一个关键词。通过文本信息确定第一对象所对应的多个候选关键词,能够实现快速高效的实现从文本信息中自动的提取候选关键词,并且根据多个候选关键词的相速度以及候选关键词为关键词的概率,对候选关键词进行过滤,能够保证最终确定的第一对象的关键词的准确性。(The embodiment of the application provides a method and a device for extracting keywords of an object, wherein the method comprises the following steps: and acquiring text information corresponding to the first object, wherein the text information is used for describing the first object. And determining a plurality of candidate keywords corresponding to the first object according to the text information. And determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability of the candidate keywords as the keywords. The candidate keywords corresponding to the first object are determined through the text information, the candidate keywords can be automatically extracted from the text information quickly and efficiently, the candidate keywords are filtered according to the phase velocities of the candidate keywords and the probability that the candidate keywords are the keywords, and the accuracy of the finally determined keywords of the first object can be guaranteed.)

1. A method for extracting a keyword of an object, comprising:

acquiring text information corresponding to a first object, wherein the text information is used for describing the first object;

determining a plurality of candidate keywords corresponding to the first object according to the text information;

and determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are keywords.

2. The method of claim 1, wherein determining a plurality of candidate keywords corresponding to the first object according to the text information comprises:

processing the text information through a first model to obtain a plurality of candidate keywords;

the first model is obtained by learning a plurality of groups of samples, each group of samples comprises sample text information and sample candidate keywords, and the plurality of groups of samples are obtained by generating the second model.

3. The method of claim 2, wherein the process of generating the plurality of sets of samples by the second model comprises:

acquiring the sample text information;

performing word segmentation processing on the sample text information through the second model to obtain a plurality of sample words and the probability that each sample word is a keyword;

and determining sample candidate keywords in the plurality of sample words according to the probability that each sample word is a keyword, wherein the probability that the sample candidate keywords are keywords is greater than a first threshold value.

4. The method according to any of claims 1-3, wherein determining at least one keyword of the first object among the plurality of candidate keywords based on the similarity of the plurality of candidate keywords and the probability that the candidate keyword is a keyword comprises:

for each two candidate keywords in the plurality of candidate keywords, judging whether the similarity between the two candidate keywords is greater than a preset threshold value;

if so, combining the two candidate keywords into one keyword according to the probability that the two candidate keywords are the keywords respectively;

and if not, determining the two candidate keywords as the keywords of the first object.

5. The method of claim 4, wherein merging the two candidate keywords into one keyword according to the probability that the two candidate keywords are keywords, respectively, comprises:

and merging the two candidate keywords into a target keyword, wherein the target keyword is a keyword with higher probability of being a keyword in the two candidate keywords.

6. The method according to any one of claims 1-5, wherein determining a plurality of candidate keywords corresponding to the first object according to the text information comprises:

sentence division processing is carried out on the text information to obtain a plurality of short sentences;

determining whether each short sentence comprises a keyword or not through a binary classification model, and determining the short sentence comprising the keyword as a target short sentence to obtain at least one target short sentence;

performing word segmentation processing on each target short sentence to obtain a plurality of first words;

filtering stop words of the first vocabulary to obtain a plurality of second vocabularies;

and performing keyword prediction processing on the plurality of second words to obtain a plurality of candidate keywords.

7. The method of any of claims 1-6, wherein the first model is a pointer generation network;

the output layer of the pointer generation network comprises a generation probability, and the generation probability is used for indicating the probability that the next output word of the decoder at each time step is from a preset word list; and

the attention distribution function of the pointer generation network includes a coverage factor.

8. The method according to any of claims 1-6, wherein the text information comprises at least one of:

network data corresponding to the first object, wherein the network data comprises description information of the first object;

and the detail page is a network page introducing the first object.

9. An apparatus for extracting a keyword of an object, comprising:

the acquisition module is used for acquiring text information corresponding to a first object, and the text information is used for describing the first object;

the determining module is used for determining a plurality of candidate keywords corresponding to the first object according to the text information;

the determining module is further configured to determine at least one keyword of the first object among the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are keywords.

10. An apparatus for extracting a keyword of an object, comprising:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 8 when the program is executed.

11. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 8.

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for extracting keywords of an object.

Background

At present, online shopping becomes a very important shopping mode, and keywords of a commodity can be provided on a graphical user interface so that a user can quickly know characteristics of the commodity.

The extraction of the keywords of the goods is particularly important, and in the prior art, when the keywords of the goods are extracted, the keywords of the goods submitted by a seller are usually received, the submitted keywords are manually checked, and the keywords which pass the checking are used as the keywords to be displayed.

However, depending on the implementation of manual submission to obtain the keywords of the goods, the operation of obtaining the keywords may be inefficient.

Disclosure of Invention

The embodiment of the application provides a method and a device for extracting keywords of an object, so as to overcome the problem of low operation efficiency of obtaining the keywords.

In a first aspect, an embodiment of the present application provides a method for extracting a keyword of an object, including:

acquiring text information corresponding to a first object, wherein the text information is used for describing the first object;

determining a plurality of candidate keywords corresponding to the first object according to the text information;

and determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are keywords.

In one possible design, determining a plurality of candidate keywords corresponding to the first object according to the text information includes:

processing the text information through a first model to obtain a plurality of candidate keywords;

the first model is obtained by learning a plurality of groups of samples, each group of samples comprises sample text information and sample candidate keywords, and the plurality of groups of samples are obtained by generating the second model.

In one possible design, the process of generating the plurality of sets of samples by the second model includes:

acquiring the sample text information;

performing word segmentation processing on the sample text information through the second model to obtain a plurality of sample words and the probability that each sample word is a keyword;

and determining sample candidate keywords in the plurality of sample words according to the probability that each sample word is a keyword, wherein the probability that the sample candidate keywords are keywords is greater than a first threshold value.

In one possible design, determining at least one keyword of the first object among the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keyword is the keyword comprises:

for each two candidate keywords in the plurality of candidate keywords, judging whether the similarity between the two candidate keywords is greater than a preset threshold value;

if so, combining the two candidate keywords into one keyword according to the probability that the two candidate keywords are the keywords respectively;

and if not, determining the two candidate keywords as the keywords of the first object.

In one possible design, merging the two candidate keywords into one keyword according to the probability that the two candidate keywords are each a keyword includes:

and merging the two candidate keywords into a target keyword, wherein the target keyword is a keyword with higher probability of being a keyword in the two candidate keywords.

In one possible design, determining a plurality of candidate keywords corresponding to the first object according to the text information includes:

sentence division processing is carried out on the text information to obtain a plurality of short sentences;

determining whether each short sentence comprises a keyword or not through a binary classification model, and determining the short sentence comprising the keyword as a target short sentence to obtain at least one target short sentence;

performing word segmentation processing on each target short sentence to obtain a plurality of first words;

filtering stop words of the first vocabulary to obtain a plurality of second vocabularies;

and performing keyword prediction processing on the plurality of second words to obtain a plurality of candidate keywords.

In one possible design, the first model generates a network for pointers;

the output layer of the pointer generation network comprises a generation probability, and the generation probability is used for indicating the probability that the next output word of the decoder at each time step is from a preset word list; and

the attention distribution function of the pointer generation network includes a coverage factor.

In one possible design, the text information includes at least one of:

network data corresponding to the first object, wherein the network data comprises description information of the first object;

and the detail page is a network page introducing the first object.

In a second aspect, an embodiment of the present application provides an apparatus for extracting a keyword of an object, including:

the acquisition module is used for acquiring text information corresponding to a first object, and the text information is used for describing the first object;

the determining module is used for determining a plurality of candidate keywords corresponding to the first object according to the text information;

the determining module is further configured to determine at least one keyword of the first object among the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are keywords.

In one possible design, the determining module is specifically configured to:

processing the text information through a first model to obtain a plurality of candidate keywords;

the first model is obtained by learning a plurality of groups of samples, each group of samples comprises sample text information and sample candidate keywords, and the plurality of groups of samples are obtained by generating the second model.

In one possible design, the process of generating the plurality of sets of samples by the second model includes:

acquiring the sample text information;

performing word segmentation processing on the sample text information through the second model to obtain a plurality of sample words and the probability that each sample word is a keyword;

and determining sample candidate keywords in the plurality of sample words according to the probability that each sample word is a keyword, wherein the probability that the sample candidate keywords are keywords is greater than a first threshold value.

In one possible design, the determining module is specifically configured to:

for each two candidate keywords in the plurality of candidate keywords, judging whether the similarity between the two candidate keywords is greater than a preset threshold value;

if so, combining the two candidate keywords into one keyword according to the probability that the two candidate keywords are the keywords respectively;

and if not, determining the two candidate keywords as the keywords of the first object.

In one possible design, the determining module is specifically configured to:

and merging the two candidate keywords into a target keyword, wherein the target keyword is a keyword with higher probability of being a keyword in the two candidate keywords.

In one possible design, the determining module is specifically configured to:

sentence division processing is carried out on the text information to obtain a plurality of short sentences;

determining whether each short sentence comprises a keyword or not through a binary classification model, and determining the short sentence comprising the keyword as a target short sentence to obtain at least one target short sentence;

performing word segmentation processing on each target short sentence to obtain a plurality of first words;

filtering stop words of the first vocabulary to obtain a plurality of second vocabularies;

and performing keyword prediction processing on the plurality of second words to obtain a plurality of candidate keywords.

In one possible design, the first model generates a network for pointers;

the output layer of the pointer generation network comprises a generation probability, and the generation probability is used for indicating the probability that the next output word of the decoder at each time step is from a preset word list; and

the attention distribution function of the pointer generation network includes a coverage factor.

In one possible design, the text information includes at least one of:

network data corresponding to the first object, wherein the network data comprises description information of the first object;

and the detail page is a network page introducing the first object.

In a third aspect, an embodiment of the present application provides an apparatus for extracting a keyword of an object, including:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being adapted to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect when the program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, comprising instructions which, when executed on a computer, cause the computer to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect.

The embodiment of the application provides a method and a device for extracting keywords of an object, wherein the method comprises the following steps: and acquiring text information corresponding to the first object, wherein the text information is used for describing the first object. And determining a plurality of candidate keywords corresponding to the first object according to the text information. And determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability of the candidate keywords as the keywords. The candidate keywords corresponding to the first object are determined through the text information, the candidate keywords can be automatically extracted from the text information quickly and efficiently, the candidate keywords are filtered according to the phase velocities of the candidate keywords and the probability that the candidate keywords are the keywords, and the accuracy of the finally determined keywords of the first object can be guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of keywords provided herein;

fig. 2 is a flowchart of a keyword extraction method for an object according to an embodiment of the present application;

fig. 3 is a flowchart of a keyword extraction method for an object according to an embodiment of the present application;

fig. 4 is a schematic network structure diagram of a first model provided in an embodiment of the present application;

FIG. 5 is a flowchart of a keyword extraction method for an object according to yet another embodiment of the present application;

FIG. 6 is a schematic diagram of a flow chart of keyword extraction for an object according to an embodiment of the present application;

FIG. 7 is a diagram of extracted keywords provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus for extracting keywords from an object according to an embodiment of the present disclosure;

fig. 9 is a schematic hardware structure diagram of an object keyword extraction device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For ease of understanding, the relevant concepts to which this application relates will be explained first:

seq2seq generates a network: the Sequence to Sequence (seq 2seq) generation Network is an end-to-end generation method based on the assumption that the Sequence structure is minimized, and the seq2seq generation Network can encode an input Sequence into an intermediate vector by using a multilayer Long-Short Term Memory Network (LSTM) and decode the intermediate vector into a target Sequence by using another depth LSTM, so as to solve the problem that the conventional Deep Neural Networks (DNNs) cannot generate sequences.

oov: not at the word stock (Out-of-vocabulary, OOV) is that at the time of natural language processing or text processing, there is usually a word stock (vocabulary), which may be pre-loaded, or custom, or extracted from the current first data set, and if there is another second data set after it, and some words in this second data set are not in the existing vocabulary, these words may be called Out-of-vocabulary, abbreviated as OOV.

Pointer generation network: the Pointer-generator network (Pointer-generator network) is a generation model for abstract extraction, and is improved from two aspects aiming at the traditional sequence-to-sequence model, and the model has the capability of copying words from a source text by introducing a Pointer and simultaneously reserves the capability of generating new words by a generator; the same word is prevented from being repeatedly generated by recording the content that has been generated using the covarage mechanism.

LSTM: the Long Short-Term Memory network (LSTM) is a time recursive neural network with Long-Term Memory capability, and important components comprise Forget Gate, Input Gate and Output Gate which are respectively responsible for determining whether the current Input is adopted or not, whether the current Input is memorized for a Long time or not and whether the Input in the Memory is Output currently or not, and the model can be used for better capturing the dependency relationship between longer sentences.

And (3) hyper-parameter: in the context of machine learning, a hyper-parameter is a parameter that is set to a value before the learning process is started, and not parameter data obtained through training.

Levenshtein ratio distance (Levenshtein distance): the method is also called as edit distance and is used for measuring the similarity of two character string sequences, namely simply calculating the minimum edit times required by changing one character string into another character string through adding, deleting and replacing operations.

N-Gram language model: is an algorithm based on a statistical language model based on the assumption that the occurrence of the Nth word in a sentence is only related to the preceding N-1 words, and the probability of the whole sentence is the product of the probabilities of the respective words, i.e. the probability of occurrence

p(w1,w2,...,wm)=p(w1)*p(w2|w1)*p(w3|w1,w2)......p(wm|w1,...,wm-1),

Considering the balance between the effect and the time space overhead, only considering n words nearest to the current word as related words to obtain a Uni-Gram model univariate model, a Bi-Gram model bivariate model and a Tri-Gram ternary model, wherein the models are commonly applied to calculating the reasonable degree of a sentence.

The technical background related to the present application is further described in detail as follows:

at present, with the continuous development of online shopping, the shopping experience of a user also needs to be correspondingly improved, when the user browses commodities through terminal equipment, if overlong commodity introduction is directly provided for the user, the characteristics of the commodities are difficult to be determined by the user in time, and meanwhile, the space in a graphical user interface in the terminal equipment is limited, so that keywords of the commodities can be provided in the graphical user interface, and the characteristics of the commodities can be rapidly determined by the user.

The display mode of the keywords of the product can be, for example, as shown in fig. 1, and fig. 1 is a schematic diagram of the keywords provided in the present application.

Referring to fig. 1, assuming that a current user browses a "mobile phone" commodity through a terminal device, information of the commodity a and the commodity B is provided on a graphical user interface of the current terminal device, and the information may include keywords of the commodity, for example, "wireless charging support" and "face recognition" illustrated as a keyword 101 of the commodity a, and "ultra-long endurance", "shocking volume" and "AI face recognition" illustrated as a keyword 102 of the commodity B.

As can be appreciated based on the description of FIG. 1, keywords for a good may be used to indicate characteristics of the good.

At present, in the prior art, when a selling point is extracted, in a possible implementation manner, a keyword of a commodity submitted by a seller may be received, manual review is performed on the submitted keyword, and the keyword that passes the review is used as a keyword to be displayed.

However, depending on the manual reporting mode, the operation efficiency of obtaining the keywords is low, and the number of the commodity categories and the number of Stock Keeping units (sku) covered by the keywords submitted by the merchants is small, which cannot support the calling of multiple applications, thereby causing the waste of platform resources, and causing the platform to be unable to update the keywords of the commodities in time when some applications receive the keywords submitted by the sellers in the form of mails.

In another possible implementation manner, the generation of the keyword may be performed based on the seq2seq generation network.

For example, a seq2seq generation network may perform the following steps: data extraction, data labeling, data preprocessing, model training and keyword generation.

The data extraction can obtain a data source comprising keywords, the data labeling can label the keywords in the data source in a manual labeling mode, the data preprocessing can perform operations such as word segmentation and stop word filtering, the model training part can divide the data set into a training set and a verification set, a network is generated according to the data set for training, a trained model is obtained, and the keywords are generated and used for inputting the data source to the trained model so as to obtain the extracted keywords.

However, the only connection between encoding and decoding in the seq2seq generation network is a fixed-length semantic vector, i.e. the encoder compresses the information of the whole sequence into a fixed-length vector.

Therefore, the seq2seq generation network has two disadvantages, namely that the semantic vector can not completely represent the information of the whole sequence, and the information carried by the firstly input content can be diluted by the later input information. The longer the input sequence is, the more serious the phenomenon that information is diluted, which results in that sufficient information of the input sequence is not obtained at the time of decoding, so that the accuracy at the time of decoding is lowered, and the seq2seq generation network depends on the target dictionary, so that the problem of insufficient vocabulary oov cannot be dealt with, and repeated words are easily generated.

That is, generating the network extraction keyword based on the seq2seq has the following two problems: the first is that the extracted keywords are not words in the original text, for example, the keywords of "dry energy saving" are currently extracted, but the "dry energy saving" is not words in the original text; another is that there are repetitions of the extracted keywords, e.g. that "comfortable and comfortable" keywords are extracted.

Based on the problems in the prior art, the method for extracting the object keywords is provided, the original text is processed through the first model to extract the object keywords quickly and efficiently, and the first model is added with the generation probability and the coverage factor, so that the problem that the extracted keywords are not repeated by words and extracted keywords in the original text in the prior art can be effectively solved.

First, referring to fig. 2, fig. 2 is a flowchart of a method for extracting a keyword from an object provided in an embodiment of the present application, where an execution subject in the embodiment of fig. 2 may be, for example, a server, or may also be a processor, and any device that can be used for data processing may be used as the execution subject in this embodiment.

As shown in fig. 2, the method includes:

s201, acquiring text information corresponding to the first object, wherein the text information is used for describing the first object.

In this embodiment, the first object may be an object from which a keyword is to be extracted, for example, the first object may be a commodity in a shopping platform, or in other possible implementation manners, the first object may also be an object of an entity, and the like.

In a possible implementation manner, the text information corresponding to the first object is used to describe the first object, and the text information corresponding to the first object is described below by taking the first object as an example, where the text information may include at least one of the following: the network data corresponding to the first object and the data in the first detail page may include, for example, a title of the article, a reach article, a high-quality comment, etc., and the data in the detail page may include, for example, article detail information.

The product detail information may be, for example, text data recognized from a picture of the product detail page by using an Optical Character Recognition (ocr) technique.

In other possible implementation manners, the text information may further include any information for describing the first object, and the specific implementation manner of the text information is not limited in this embodiment.

S202, determining a plurality of candidate keywords corresponding to the first object according to the text information.

In this embodiment, the text information is used to describe the first object, and the text information may include at least one keyword corresponding to the first object, so that a plurality of candidate keywords corresponding to the first object may be determined according to the text information.

In a possible implementation manner, the text information may be processed through a first model to extract a plurality of candidate keywords from the text information, where the first model is a model for extracting keywords from the text information, where the first model is learned from a plurality of groups of samples, and each group of samples includes sample text information and sample candidate keywords.

Therefore, in the embodiment, the text information is processed based on the first model, and the multiple candidate keywords can be effectively extracted from the text information quickly and efficiently.

In another possible implementation, word segmentation processing may be performed on the text information, and whether each word after word segmentation is a keyword is determined, so as to extract a plurality of candidate keywords from the text information.

In this embodiment, when the first object is a commodity, the keyword of the first object may be understood as a selling point of the commodity, for example.

S203, determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are the keywords.

It can be understood that the keywords obtained in step S202 are candidate keywords, and further screening is required, for example, if there are words with higher similarity in a plurality of candidate keywords, for example, if the similarity of "very long endurance" and "very long standby" is higher, one of the two candidate keywords with higher similarity may be selected as the keyword of the first object.

In a possible implementation manner, the similarity between any two candidate keywords in the multiple candidate keywords may be obtained, and if two candidate keywords with the similarity greater than a preset threshold exist, the two candidate keywords may be merged, for example, the probability that the two candidate keywords are keywords may be calculated, and the two candidate keywords may be merged into a keyword with a higher probability.

The above operation is performed for any two candidate keywords, thereby realizing determination of at least one keyword of the object among the candidate keywords.

In a possible implementation manner, after obtaining the at least one keyword of the first object, the embodiment may provide the at least one keyword at a position in the graphical user interface corresponding to the first object, so that a user may quickly obtain the keyword of the first object, and thus, characteristics of the first object are quickly and efficiently known.

The method for extracting the keywords of the object provided by the embodiment of the application comprises the following steps: and acquiring text information corresponding to the first object, wherein the text information is used for describing the first object. And determining a plurality of candidate keywords corresponding to the first object according to the text information. And determining at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability of the candidate keywords as the keywords. The candidate keywords corresponding to the first object are determined through the text information, the candidate keywords can be automatically extracted from the text information quickly and efficiently, the candidate keywords are filtered according to the phase velocities of the candidate keywords and the probability that the candidate keywords are the keywords, and the accuracy of the finally determined keywords of the first object can be guaranteed.

On the basis of the foregoing embodiment, the following describes in further detail the method for extracting a keyword from an object provided by the present application, where fig. 3 is a flowchart of the method for extracting a keyword from an object provided by the present application embodiment, and fig. 4 is a schematic diagram of a network structure of a first model provided by the present application embodiment.

As shown in fig. 3, the method includes:

s301, acquiring text information corresponding to the first object, wherein the text information is used for describing the first object.

The implementation manner of S301 is the same as that of S201, and is not described herein again.

And S302, sentence dividing processing is carried out on the text information to obtain a plurality of short sentences.

The sentence dividing processing may be implemented by, for example, obtaining punctuation marks in the text information, and dividing the text information into sentences according to the punctuation marks to obtain a plurality of short sentences.

For example, the current text information that the wall-mounted air conditioner does not occupy the space of a lower-layer room, is popular in small-sized families, supports 4 sleep modes, meets the sleep requirements of different people, ensures that the family has sufficient rest, and solves the problem of dust in the machine by skillfully utilizing the self-cleaning technology of the evaporator so as to bring clean air. "

Then, the text information is subject to sentence division processing, for example, the following phrases can be obtained: the wall-mounted air conditioner does not occupy the lower-layer room space, is popular in small-sized families, supports 4 sleep modes, meets the sleep requirements of different people, ensures the full rest of the family, solves the problem of dust in the machine by skillfully utilizing the self-cleaning technology of the evaporator and brings clean air.

In other possible implementation manners, any sentence segmentation algorithm may be adopted to perform sentence segmentation processing on the text information, and the specific implementation of the sentence segmentation processing is not particularly limited as long as the text information can be divided into a plurality of short sentences.

S303, determining whether each short sentence comprises a keyword or not through a two-classification model, and determining the short sentence comprising the keyword as a target short sentence to obtain at least one target short sentence.

In this embodiment, in order to improve the operation efficiency, a plurality of phrases may be preliminarily screened through the binary classification model.

For example, whether each short sentence includes a keyword or not can be determined through the two-classification model, short sentences not including the keyword are screened out, subsequent processing is not performed, short sentences including the keyword are determined as target short sentences, at least one target short sentence is obtained, processing is performed only on the target short sentences subsequently, and the operation efficiency of keyword extraction can be effectively improved.

In this embodiment, the binary classification model is a model trained according to the sample keywords, so that the binary classification model in this embodiment can determine whether the short sentence includes the keywords, where the input of the binary classification model is the short sentence, and the output is whether the short sentence includes the keywords.

In the actual implementation process, the detailed implementation of the two-classification model can be selected according to actual requirements, as long as the two-classification model can output whether the short sentence includes the keyword or not.

S304, performing word segmentation processing on the target short sentence to obtain a plurality of first words.

S305, stop word filtering processing is carried out on the first vocabulary, and a plurality of second vocabularies are obtained.

S304 and S305 are described below together:

in this embodiment, the target phrase is an phrase including the keyword determined by the binary classification model, so that the target phrase can be participled to obtain a plurality of first words.

The specific implementation of the word segmentation processing may be selected according to actual requirements, for example, the word segmentation processing may be performed on the target short sentence according to a word segmentation algorithm, so as to obtain a plurality of first words, where the word segmentation algorithm may be any one of the following: the present embodiment does not limit the specific implementation of the word segmentation process, and the word segmentation method based on the character string matching, the word segmentation method based on the understanding, and the word segmentation method based on the statistics.

After obtaining the plurality of first vocabularies, the first vocabularies may be subjected to stop word filtering processing, wherein the stop words may include words with no actual meaning, such as "what", "in", "and", "then", and the like, or may also include words with frequent use, such as "me", "then", and the like, and the stop words in the first vocabularies are filtered to obtain a plurality of second vocabularies.

In one possible implementation, the stop word list may be predetermined, such that the stop word filtering process is performed on the first vocabulary according to the stop word list.

S306, performing keyword prediction processing on the plurality of second words to obtain a plurality of candidate keywords.

The keyword prediction process may obtain the candidate keyword according to a plurality of second words.

In a possible implementation manner, the candidate keywords may be generated according to a Pointer-generating network (Pointer-Generator Networks), for example, text information may be input into the Pointer-generating network to obtain a plurality of candidate keywords, and a network structure of the Pointer-generating network in this embodiment is described below with reference to fig. 4.

As shown in fig. 4, the characters in the text message can be fed into an encoder (encoder) one by one, thereby generating a series of encoder hidden states hiWherein h isiThe calculation of attention coefficients is involved, where the encoder may be, for example, a single layer bi-directional lstm.

In each round of processing, the output state s at the decoder (decoder) endtWill also participate in the calculation of the attention coefficient, e.g. by the output state s of the decodertAnd encoder hidden state hiAttention coefficients are calculated, where the decoder may be, for example, a single layer bi-directional lstm.

Based on the above description, it can be determined that there are two problems with the current seq2seq generation networks:

1) the extracted keywords are not words in the original text

For example, the original sentence is segmented into [ 'dry', 'dust-removal', 'effective', 'reduced', 'your', 'clean', 'annoying' ]

The keywords extracted by the seq2seq generation network may be: drying and energy saving are realized, and the keyword is not a word in the original sentence but any word in a word list, so that the generated keyword is lack of accuracy.

2) Extracted keyword presence repetition

For example, the word segmentation of the original sentence is as follows: [ 'thickened', 'designed', 'feel', comfort ]

The selling points of seq2seq generation network extraction may be: comfort and comfort, and thus there is duplication of generated keywords.

In response to the first problem that the extracted keywords are not words in the original text, the pointer network in this embodiment adds the generation probability P to the output layergen

Wherein a probability P is generatedgenIs used to determine the probability of whether the next output word at each time step decoder is from the Source Text (Source Text) or Vocabulary (vocibulary) so that when the oov problem is encountered, copy may be taken directly from the original sentence as output to avoid the problem of extracted keywords not being words in the original Text.

Wherein a probability P is generatedgenFor example, the following formula one may be satisfied:

wherein, PgenTo generate a probability representing a probability value calculated from the state of the input layer, the state of the decoding layer and the input vector of the decoding layer, wherein the probability value represents a probability that the next output is obtained from the vocabulary,are all hyperparameters, stIn order to make the coefficients of the balancing,representing the original sentenceThe vocabulary output of the current moment or the vocabulary output of the predefined vocabulary, x, is obtained by the middle copytσ is a sigmoid function for the input vector of the current decoder.

The final vocabulary output is p (w), wherein p (w) can satisfy the following formula two:

wherein p isvocab(w) represents the probability that the current decoder output is a word in the vocabulary,representing the probability that the output of the current decoder is a word in the original sentence.

Then, if the vocabulary at the current moment is not appeared in the original text, the vocabulary is displayedIs 0, if the vocabulary at the current moment is a word not recorded in the predefined vocabulary, then pvocab(w) is 0.

Generating probability P by adding in output layer of pointer generation networkgenTherefore, the problem that the extracted keywords are not words in the original text can be effectively solved, for example, in the above example, the correct keywords can be extracted as: drying and dedusting, which is the word in the original text.

In view of the second problem mentioned above that "there is duplication in extracted keywords", the pointer network in this embodiment introduces a coverage factor in the attention distribution function, where the coverage factor is formed by the sum of the attention mechanisms of all time step decoder layers before the current time step t, in order to avoid focusing on a word that has been focused on before, and the coverage factor may satisfy the following formula three, for example:

wherein, ctAnd the covering vector is used for calculating the word which is focused from 0 to t-1 steps before the t-th time step, and the focused word is prevented from being focused again at the t time step, so that the problem that repeated words exist in an output result is solved, wherein the focused information is taken as input and directly added into a focusing mechanism of an input end, so that the focus on the original text can be guided.

After the coverage factor is introduced, the calculation mode of the attention weight of the new encoder end can be correspondingly adjusted, wherein the attention weight of the new encoder endThe following formula four can be satisfied:

wherein, Wh、WS、WcIs a parameter to be learned, tanh is a hyperbolic function, hiIs the hidden state of the current time point at the encoder end, stFor the hidden state of the last point in time output at the decoder side,for the t-th time step, the coverage vector of the i-th word, battnFor bias parameters to be learned, vTAre parameters that need to be learned.

By introducing the coverage factor, it is possible to effectively avoid paying attention to a word that has been paid attention to before, so as to avoid the problem that the extracted keywords are repeated, for example, in the above example, the correct keywords may be extracted as: the touch feeling is comfortable, and the repeated extracted keywords are avoided.

S307, judging whether the similarity between two candidate keywords is larger than a preset threshold value or not according to every two candidate keywords in the candidate keywords, if so, executing S308, and if not, executing S309.

After obtaining the plurality of candidate keywords, because some of the candidate keywords are very similar, the candidate keywords may be screened according to the similarity of the plurality of candidate keywords, and in a possible implementation manner, for example, for every two candidate keywords in the plurality of candidate keywords, the similarity between the two candidate keywords may be calculated, and whether the similarity is greater than a preset threshold value is determined.

One possible implementation of calculating the similarity between two candidate keywords is described below:

for example, the similarity may be calculated according to the following formula five:

similarity ═ 4 (Levinstein ratio + Jaro-Winkler distance + longest common substring + edit distance)/4

Formula five

Wherein the levenstein ratio can satisfy the following formula six:

assuming that the similarity between the character string a and the character string b is calculated, similarity is a levenstein ratio, sum is the total length of the character string a and the character string b, ldist is a class editing distance, and deletion and insertion in the class editing distance are still +1, but +2 is replaced.

Here, an example is given: for example, if the edit distance is 1 and the similarity is calculated as 0.5, it is obviously not appropriate that a is me, and b is you, and the similarity is calculated as 0.

The Jaro-Winkler distance may satisfy the following formula seven:

dw=dj+lp(1-dj) Formula seven

Where p is the factor used to adjust the prefix match, l is the length of the prefix match, dwCalculating distance for final similarity, wherein the Jaro-Winkler algorithm gives higher scores to the same character string as the beginning part, thus defining p and l, where djThe following formula eight is satisfied:

wherein s1 and s2 are two character strings to be compared, m is the matching length of s1 and s2, t is the number of transpositions, djIs the final score.

The number of transposition needs to be determined according to the value of the matching window, when the distance between the two characters is smaller than the value of the matching window, the two characters are considered to be matched, but transposition is needed when the positions are different.

Wherein, the matching window value may satisfy the following formula nine:

where MW is the matching window value and MAX is a function of the maximum value.

Here, an example is given: for example, a set of strings AECFR and AMECFDR, MW 2.5 and m 5, the matching string a-E-C-F-R has an order in both strings, so no transposition is required, t 0.

The longest common substring may satisfy the following formula ten:

where avg is a function used for averaging.

The edit distance may satisfy the following formula eleven:

in this embodiment, the preset threshold corresponding to the similarity may be set according to actual requirements, and this embodiment does not particularly limit this.

S308, combining the two candidate keywords into a keyword according to the probability that the two candidate keywords are the keywords respectively.

In one possible implementation, if the similarity between two candidate keywords is greater than a preset threshold, the probability that each of the two candidate keywords is a keyword may be determined, for example, two candidate keywords of "long endurance" and "long standby" currently exist.

It is possible to determine the probability that "very long endurance" is the keyword and the probability that "very long standby" is the keyword, and combine two candidate keywords into one keyword according to the probabilities of the two candidate keywords.

In one possible implementation, the two candidate keywords may be merged into a target keyword, where the target keyword is a keyword with a higher probability of being a keyword in the two candidate keywords.

For example, if the probability of "extra long endurance" being the keyword is 98%, and the probability of "extra long standby" being the keyword is 87%, then the two candidate keywords "extra long endurance" and "extra long standby" may be merged into "extra long endurance", so as to obtain the keyword: and (5) super-long endurance.

The implementation manner of determining the probability of the candidate keyword may be, for example, processing the candidate keyword through an N-Gram language model to obtain the probability that the candidate keyword is the keyword.

In other possible implementations, for example, one of the two candidate keywords may be arbitrarily selected as the target keyword.

S309, determining the two candidate keywords as the keywords of the first object.

In another possible implementation manner, if the similarity between the two candidate keywords is not greater than the preset threshold, it may be determined that the two candidate keywords are not similar, and at this time, a merge operation is not required, and both the two candidate keywords may be determined as the keywords of the first object.

According to the method for extracting the keywords of the object, the candidate words corresponding to the first object are determined in the text information through the pointer generation network, the generation probability is introduced into the pointer generation network, the situation that the extracted keywords are not words in the original text can be effectively avoided, and the coverage factor is introduced into the pointer generation network, so that the problem that the extracted keywords are repeated is effectively avoided, and the correctness of the extracted keywords is effectively guaranteed. And by determining the similarity between every two candidate keywords, similar candidate keywords can be merged to realize the simplicity and accuracy of the finally determined keywords of the first object.

On the basis of the foregoing embodiment, before processing text information by using a first model, the first model needs to be trained, so that the first model can output candidate keywords according to the text information, it can be understood that training of the first model requires sample training data, in a possible implementation manner of this embodiment, a second model may be used to generate multiple sets of samples, and the following describes, in combination with a specific embodiment, an implementation manner of generating multiple sets of samples according to the second model in this application:

fig. 5 is a flowchart of a keyword extraction method for an object according to yet another embodiment of the present application.

As shown in fig. 5, the method includes:

s501, obtaining sample text information.

In the present embodiment, the sample text information is similar to the text information described above, except that the text information is used for directly extracting keywords, and the sample text information is used for generating training data in the present embodiment.

S502, performing word segmentation processing on the sample text information through the second model to obtain a plurality of sample words and the probability that each sample word is a keyword.

In this embodiment, the second model may be, for example, an N-Gram language model, where the N-Gram language model may output a plurality of sample words and probabilities that the sample words are keywords with sample text information as input.

Before obtaining the sample vocabulary and the probability according to the N-Gram language model, the N-Gram language model is trained, for example, the N-Gram model can be trained based on the existing keyword data to obtain the trained N-Gram model, so that the N-Gram model can output the probability that the sample vocabulary and the sample vocabulary are keywords.

After the training of the N-Gram model is completed, the sample text information may be first subjected to word segmentation processing and stop word processing, which are similar to those described above and will not be described herein again.

Then, keyword candidate phrases can be generated, since the length of the keyword of the first object is usually 3-7 words, N in the N-Gram can be selected from 1, 2, and 3, so as to generate 1-Gram, 2-Gram, and 3-Gram phrases, for example, "one key", "automatic", and "cleaning" currently exist, and the generated 1-Gram phrases can include "one key", "automatic", and "cleaning", for example, the generated 2-Gram phrases can include "one key automatic" and "automatic cleaning", and the generated 3-Gram phrases can include "one key automatic cleaning", for example.

In this embodiment, the candidate phrases of the keywords may be used as sample words, and then the candidate phrases of the keywords may be scored based on the trained N-Gram model, so as to obtain the probability that each sample word is a keyword.

The idea of generating the keywords by scoring the phrases based on the N-Gram model is to infer the reasonability probability that a new short sentence is the keyword based on the probability of the phrase combination counted in the existing keyword training data, wherein the probability is higher, and the probability is higher, so that the probability is higher, and the probability is higher.

For example: the probability of 'one-touch automatic cleaning' of this phrase being a keyword is calculated as follows: where < s > denotes start and </s > denotes end.

P ('one-key automatic cleansing') P (automatic | one-key) × P (cleansing | automatic) × P (</s > | cleansing)

Generally, since the probability values are all less than 1, to avoid the successive multiplication resulting in the score being smaller and smaller, both sides need to be logged, becoming summed, as follows:

log (P ('one-key automatic wash'))) + log (P (one-key | < s >) + log (P (automatic | one-key)) + log (P (wash | automatic)) + log (P (</s > | wash))

Thus, an implementation of calculating the probability that a certain phrase is a keyword may satisfy the following formula twelve:

wherein, C (w)i-1,wi) The expression wi-1And wiNumber of occurrences together in sample text information, C (w)i-1) The expression wi-1Total number of occurrences, P (w)i|wi-1) The expression wiIs the probability of the keyword.

S503, determining sample candidate keywords in the plurality of sample vocabularies according to the probability that each sample vocabulary is the keyword, wherein the probability that the sample candidate keywords are the keywords is larger than a first threshold value.

After obtaining the probability of each sample vocabulary, keywords having a probability greater than a first threshold value among the plurality of sample vocabularies may be determined as sample keywords, thereby obtaining each of the plurality of sample keywords.

The training data in this embodiment includes sample text information and sample keywords, and in the process of training the first model according to the training data, the sample text information may be processed by using the first model, and the first model learns according to the sample keywords corresponding to the sample text information, thereby implementing training of the first model.

In the embodiment of the application, the sample text data is processed through the second model to obtain the sample keywords, so that the training data can be automatically generated, wherein the training data comprises the sample text data and the sample keywords, and the efficiency of obtaining the training data can be effectively improved.

In another possible implementation manner, for example, if the first object is a commodity, the core industrial attribute sorted from the commodity industrial attribute may be associated with the reach article to obtain a keyword phrase candidate set, and a required sample keyword is screened from the keyword phrase candidate set in a manual manner.

After the first model is trained according to the obtained training data, the first model can be deployed in an interface mode for calling, so that the situation that the first model is too large and node memories occupied by the model needing to be distributed to all nodes in distributed calling are large is avoided, network resources consumed in the distribution process are effectively saved, and the prediction time is shortened.

On the basis of the foregoing embodiment, the following introduces an overall process of training, deploying, and implementing a model in the method for extracting a keyword of an object provided by the present application with reference to fig. 6, where fig. 6 is a schematic diagram of a flow unit of extracting a keyword of an object provided by the present application embodiment:

referring to fig. 6, fig. 6 includes a data unit, a model training unit, a model deployment unit, and a keyword extraction unit.

The data unit may automatically generate the training data through the second model described above, and/or may obtain the training data manually.

The training data obtained by the data unit can be used for a model training unit, and the model training unit trains the first model according to the training data to obtain the trained first model.

And then the model deployment unit deploys the trained first model in an interface mode, so that resources are effectively saved, and the prediction time is shortened.

Finally, the keywords may be extracted according to the keyword extraction unit, for example, text information may be input into the trained first model, so that the first model outputs candidate keywords, and the candidate keywords may be screened according to the similarity between every two candidate keywords in the candidate keywords, so as to obtain the keywords of the first object.

In a possible implementation manner, as can be understood with reference to fig. 7, fig. 7 is a schematic diagram of extracted keywords provided in an embodiment of the present application.

The text information may be as shown in 701 in fig. 7, and the keywords extracted according to each text information may be as shown in 702 in fig. 7, and in one possible implementation, 702 in fig. 7 may be a candidate keyword in the present application.

The specific implementation of each unit is described in detail in the above embodiments, and to sum up, the object keyword extraction method provided by the present application can effectively implement automatic extraction of the keyword of the first object from the text information, and can ensure the correctness and simplicity of the extracted keyword.

Fig. 8 is a schematic structural diagram of an object keyword extraction apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus 80 includes: an acquisition module 801 and a determination module 802.

An obtaining module 801, configured to obtain text information corresponding to a first object, where the text information is used to describe the first object;

a determining module 802, configured to determine, according to the text information, a plurality of candidate keywords corresponding to the first object;

the determining module 802 is further configured to determine at least one keyword of the first object in the candidate keywords according to the similarity of the candidate keywords and the probability that the candidate keywords are keywords.

In one possible design, the determining module 802 is specifically configured to:

processing the text information through a first model to obtain a plurality of candidate keywords;

the first model is obtained by learning a plurality of groups of samples, each group of samples comprises sample text information and sample candidate keywords, and the plurality of groups of samples are obtained by generating the second model.

In one possible design, the process of generating the plurality of sets of samples by the second model includes:

acquiring the sample text information;

performing word segmentation processing on the sample text information through the second model to obtain a plurality of sample words and the probability that each sample word is a keyword;

and determining sample candidate keywords in the plurality of sample words according to the probability that each sample word is a keyword, wherein the probability that the sample candidate keywords are keywords is greater than a first threshold value.

In one possible design, the determining module 802 is specifically configured to:

for each two candidate keywords in the plurality of candidate keywords, judging whether the similarity between the two candidate keywords is greater than a preset threshold value;

if so, combining the two candidate keywords into one keyword according to the probability that the two candidate keywords are the keywords respectively;

and if not, determining the two candidate keywords as the keywords of the first object.

In one possible design, the determining module 802 is specifically configured to:

and merging the two candidate keywords into a target keyword, wherein the target keyword is a keyword with higher probability of being a keyword in the two candidate keywords.

In one possible design, the determining module 802 is specifically configured to:

sentence division processing is carried out on the text information to obtain a plurality of short sentences;

determining whether each short sentence comprises a keyword or not through a binary classification model, and determining the short sentence comprising the keyword as a target short sentence to obtain at least one target short sentence;

performing word segmentation processing on each target short sentence to obtain a plurality of first words;

filtering stop words of the first vocabulary to obtain a plurality of second vocabularies;

and performing keyword prediction processing on the plurality of second words to obtain a plurality of candidate keywords.

In one possible design, the first model generates a network for pointers;

the output layer of the pointer generation network comprises a generation probability, and the generation probability is used for indicating the probability that the next output word of the decoder at each time step is from a preset word list; and

the attention distribution function of the pointer generation network includes a coverage factor.

In one possible design, the text information includes at least one of:

network data corresponding to the first object, wherein the network data comprises description information of the first object;

and the detail page is a network page introducing the first object.

The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 9 is a schematic diagram of a hardware structure of an object keyword extraction device according to an embodiment of the present application, and as shown in fig. 9, an object keyword extraction device 90 according to the present embodiment includes: a processor 901 and a memory 902; wherein

A memory 902 for storing computer-executable instructions;

the processor 901 is configured to execute computer-executable instructions stored in the memory to implement the steps performed by the keyword extraction method of the object in the foregoing embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 902 may be separate or integrated with the processor 901.

When the memory 902 is separately provided, the keyword extracting apparatus of the object further includes a bus 903 for connecting the memory 902 and the processor 901.

An embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method for extracting the keyword of the object, which is performed by the apparatus for extracting the keyword of the object, is implemented.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种无监督跨语言句对齐实现方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!