Title text generation method and device, computer storage medium and electronic equipment

文档序号：1043312 发布日期：2020-10-09 浏览：7次中文

阅读说明：本技术 标题文本生成方法、装置、计算机存储介质和电子设备 (Title text generation method and device, computer storage medium and electronic equipment ) 是由郭昆陶通赫阳于 2019-04-25 设计创作，主要内容包括：本公开涉及计算机技术领域,具体涉及了一种标题文本生成方法及装置、存储介质和电子设备。该方法包括：获取一级候选标题的困惑度；根据所述困惑度对所述一级候选标题进行过滤处理,以获取二级候选标题；获取所述二级候选标题的点击概率；基于所述点击概率对所述二级候选标题进行排序,并从排序后的所述二级候选标题中确定目标候选标题。本公开通过结合候选标题的困惑度和点击概率,对获选标题进行综合排序和过滤处理以确定目标候选标题,提高了目标候选标题的准确性和逻辑性,也增加了目标候选标题对用户的吸引程度。(The disclosure relates to the technical field of computers, and in particular relates to a title text generation method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring the confusion degree of the first-level candidate title; filtering the primary candidate titles according to the perplexity to obtain secondary candidate titles; acquiring the click probability of the secondary candidate title; and sorting the secondary candidate titles based on the click probability, and determining target candidate titles from the sorted secondary candidate titles. According to the method and the device, the selected titles are subjected to comprehensive sequencing and filtering processing by combining the confusion degree and the click probability of the candidate titles so as to determine the target candidate titles, so that the accuracy and the logic of the target candidate titles are improved, and the attraction degree of the target candidate titles to users is also increased.)

1. A title text generation method, comprising:

acquiring the confusion degree of the first-level candidate title;

filtering the primary candidate titles according to the perplexity to obtain secondary candidate titles;

acquiring the click probability of the secondary candidate title;

and sorting the secondary candidate titles based on the click probability, and determining target candidate titles from the sorted secondary candidate titles.

2. The method of generating a caption text according to claim 1, wherein before the acquiring the confusion of the primary candidate caption, the method further comprises:

extracting article label words and target keywords in text information corresponding to the current article, and generating the primary candidate title according to the article label words and the target keywords.

3. The method of claim 1, wherein the primary candidate headline comprises at least two target keywords;

the obtaining of the confusion degree of the primary candidate title comprises the following steps:

calculating the co-occurrence probability of the target keyword pair in the primary candidate title;

and determining the confusion degree according to the co-occurrence probability of the target keyword pair, wherein the target keyword pair consists of any two adjacent target keywords in the primary candidate title.

4. The title text generation method according to claim 3, wherein the target keyword pair includes a first keyword and a second keyword in order;

the calculating the co-occurrence probability of the target keyword pair in the primary candidate title comprises:

acquiring target object words corresponding to the primary candidate titles;

acquiring a first probability of a first target title in a preset title library, wherein the first target title is a title comprising the target item words and the target keyword pairs;

acquiring a second probability of a second target title appearing in the preset title library, wherein the second target title is a title containing the target item words and the first keywords;

and comparing the first probability with the second probability to obtain the co-occurrence probability.

5. The method of claim 3, wherein said determining the degree of confusion according to the co-occurrence probability of the target keyword pair, wherein the target keyword pair is composed of any two adjacent target keywords in the primary candidate headline, comprises:

and calculating the reciprocal of the geometric mean of the co-occurrence probability, and determining the reciprocal as the confusion degree.

6. The title text generation method of claim 2, wherein said primary candidate titles are plural;

the filtering the primary candidate titles according to the perplexity to obtain secondary candidate titles includes:

sorting the primary candidate titles from low to high according to the confusion degree to form a first sequence;

and sequentially intercepting a first preset number of primary candidate titles from the first sequence, and taking the first preset number of primary candidate titles as the secondary candidate titles.

7. The title text generation method according to claim 4, wherein the target keywords include concept words and item words;

the obtaining of the click probability of the secondary candidate title includes:

acquiring a first click rate of a third target title, wherein the third target title is a title containing a target concept word and the target article word;

acquiring a second click rate corresponding to the title containing the target item word;

comparing the first click rate with the second click rate to obtain a third click rate corresponding to the target concept word;

and acquiring the click probability of the secondary candidate title according to the third click rate.

8. The method of claim 7, wherein obtaining the click probability of the secondary candidate headline according to the third click rate comprises:

and calculating the geometric mean of each third click rate, and taking the geometric mean as the click probability.

9. The method of claim 1, wherein said ranking said secondary candidate headlines based on said click probability and determining target candidate headlines from said ranked secondary candidate headlines comprises:

sorting the secondary candidate titles from high to low according to the click probability to form a second sequence;

according to the confusion degree and the target keywords corresponding to the secondary candidate titles, carrying out duplicate removal processing on the secondary candidate titles in the second sequence to obtain a third sequence;

acquiring the lowest confusion degree corresponding to the secondary candidate titles in the third sequence, and filtering the third sequence according to the lowest confusion degree to acquire a fourth sequence;

and sequentially intercepting a second preset number of the secondary candidate titles from the fourth sequence, and taking the second preset number of the secondary candidate titles as the target candidate titles.

10. The caption text generation method as claimed in any one of claims 1 to 9, wherein the order of the primary candidate headline corresponds to the number of concept words in the target keywords contained in the primary candidate headline.

11. An apparatus for generating a caption text, the apparatus comprising:

the confusion degree acquisition module is used for acquiring the confusion degree of the first-level candidate title;

the filtering module is used for filtering the primary candidate titles according to the confusion degree so as to obtain secondary candidate titles;

the click probability obtaining module is used for obtaining the click probability of the secondary candidate title;

and the determining module is used for sequencing the secondary candidate titles based on the click probability and determining target candidate titles from the sequenced secondary candidate titles.

12. A storage medium having stored thereon a computer program which, when executed by a processor, implements the title text generation method according to any one of claims 1 to 10.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the title text generation method of any one of claims 1 to 10 via execution of the executable instructions.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a title text generation method, a title text generation apparatus, a computer storage medium, and an electronic device.

Background

With the development of computer technology and internet technology, there is a general demand for increasing the access amount of content (e.g., browsing amount of article information and news information) on internet platforms, and in view of massive data, in order to increase the access amount of content and facilitate users to find target content in the shortest time, whether high-quality and attractive title text can be provided has become one of the non-trivial and non-trivial problems.

In the related art, the title of the related content is mainly generated based on a preset rule or a language model, however, the accuracy and the smoothness of the title text and the attraction to the user are difficult to balance by the methods, on one hand, the quality evaluation of the generated title is lacked, the attaching degree of the title and the related content is difficult to control, and the accuracy of the generated title is not high; on the other hand, in order to improve the accuracy of generating the titles, the consideration of the potential attractiveness of the titles to the user is often ignored.

Therefore, it is necessary to provide a new title text generation method.

It is to be noted that the information invented in the background section above is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a method and an apparatus for generating a title text, a computer storage medium, and an electronic device, so as to avoid, at least to some extent, problems in the aspects of difficulty in balancing accuracy, smoothness, and appeal to a user of a generated title text.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided a title text generation method including: acquiring the confusion degree of the first-level candidate title; filtering the primary candidate titles according to the perplexity to obtain secondary candidate titles; acquiring the click probability of the secondary candidate title; and sorting the secondary candidate titles based on the click probability, and determining a final title from the sorted secondary candidate titles.

In an exemplary embodiment of the disclosure, before the obtaining of the confusion of the primary candidate titles, the method further includes: extracting article label words and target keywords in text information corresponding to the current article, and generating the primary candidate title according to the article label words and the target keywords.

In an exemplary embodiment of the present disclosure, the primary candidate titles include at least two target keywords; the obtaining of the confusion degree of the primary candidate title comprises the following steps: calculating the co-occurrence probability of the target keyword pair in the primary candidate title; and determining the confusion degree according to the co-occurrence probability of the target keyword pair, wherein the target keyword pair consists of any two adjacent target keywords in the primary candidate title.

In an exemplary embodiment of the present disclosure, the target keyword pair includes a first keyword and a second keyword in sequence; the calculating the co-occurrence probability of the target keyword pair in the primary candidate title comprises: acquiring target object words corresponding to the primary candidate titles; acquiring a first probability of a first target title in a preset title library, wherein the first target title is a title comprising the target item words and the target keyword pairs; acquiring a second probability of a second target title appearing in the preset title library, wherein the second target title is a title containing the target item words and the first keywords; and comparing the first probability with the second probability to obtain the co-occurrence probability.

In an exemplary embodiment of the present disclosure, the determining the confusion according to the co-occurrence probability of the target keyword pair, where the target keyword pair is composed of any two adjacent target keywords in the primary candidate titles, includes: and calculating the reciprocal of the geometric mean of the co-occurrence probability, and determining the reciprocal as the confusion degree.

In an exemplary embodiment of the disclosure, the filtering the primary candidate headline according to the perplexity to obtain a secondary candidate headline includes: sorting the primary candidate titles from low to high according to the confusion degree to form a first sequence; and sequentially intercepting a first preset number of primary candidate titles from the first sequence, and taking the first preset number of primary candidate titles as the secondary candidate titles.

In an exemplary embodiment of the present disclosure, the target keywords include concept words and item words; the obtaining of the click probability of the secondary candidate title includes: acquiring a first click rate of a third target title, wherein the third target title is a title containing a target concept word and the target article word; acquiring a second click rate corresponding to the title containing the target item word; comparing the first click rate with the second click rate to obtain a third click rate corresponding to the target concept word; and acquiring the click probability of the secondary candidate title according to the third click rate.

In an exemplary embodiment of the disclosure, the obtaining, according to the third click rate, the click probability of the secondary candidate title includes: and calculating the geometric mean of each third click rate, and taking the geometric mean as the click probability.

In an exemplary embodiment of the disclosure, the sorting the secondary candidate titles based on the click probability and determining a target candidate title from the sorted secondary candidate titles includes: sorting the secondary candidate titles from high to low according to the click probability to form a second sequence; according to the confusion degree and the target keywords corresponding to the secondary candidate titles, carrying out duplicate removal processing on the secondary candidate titles in the second sequence to obtain a third sequence; acquiring the lowest confusion degree corresponding to the secondary candidate titles in the third sequence, and filtering the third sequence according to the lowest confusion degree to acquire a fourth sequence; and sequentially intercepting a second preset number of the secondary candidate titles from the fourth sequence, and taking the second preset number of the secondary candidate titles as the target candidate titles.

In an exemplary embodiment of the present disclosure, the rank of the primary candidate headline corresponds to the number of concept words in the target keyword included in the primary candidate headline.

According to an aspect of the present disclosure, there is provided a caption text generating apparatus including: the confusion degree acquisition module is used for acquiring the confusion degree of the first-level candidate title; the filtering module is used for filtering the primary candidate titles according to the confusion degree so as to obtain secondary candidate titles; the click probability obtaining module is used for obtaining the click probability of the secondary candidate title; and the determining module is used for sequencing the secondary candidate titles based on the click probability and determining a final title from the sequenced secondary candidate titles.

According to an aspect of the present disclosure, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the title text generation method of any one of the above.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the above-described title text generation methods via execution of the executable instructions.

The title text generation method in the exemplary embodiment of the present disclosure performs comprehensive ranking and filtering processing on the candidate titles by combining the confusion and click probability of the candidate titles to determine the final title. On one hand, the primary candidate titles which are relatively unsmooth and have poor logicality are filtered according to the confusion degree, and the quality and readability of the final titles are improved on the whole; on the other hand, to a certain extent, the click probability of the secondary candidate title characterizes the attraction of the title to the user, so that the final title determined after the ranking based on the click probability fully considers the potential user attraction of the title, and therefore, the readability, the logicality and the attraction to the user of the title generation are balanced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 illustrates a flowchart of a title text generation method according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram for forming a level one candidate header according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow chart for calculating co-occurrence probabilities of target keyword pairs in primary candidate headings according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flowchart for obtaining click probabilities for secondary candidate headlines, according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a flowchart for determining target candidate headings based on secondary candidate headings according to an example embodiment of the present disclosure;

fig. 6 shows a schematic structural diagram of a caption text generation apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a storage medium according to an exemplary embodiment of the present disclosure; and

fig. 8 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Exemplary embodiments will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of exemplary embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their detailed description will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

In the related art in this field, the title text is mainly generated as follows: extracting effective information from the related content based on a preset rule, and combining the effective information to generate a title; the recognition of the information expressed by the title text is realized by means of the training of the language model to assist in generating the title text, for example, by training an LSTM (Long Short-term memory) network to output the corresponding title text according to the extraction of the features.

Accordingly, the title text generation method in the related art has the following defects: on one hand, the title is generated according to a specified mode, and an evaluation process for the quality of the title is lacked, such as whether the generated title is smooth or not, whether the generated title has high attaching degree with related content (such as item attribute information) or not, whether complete semantic information exists or not, and the like; on the other hand, the title obtained through model training improves the accuracy of the title to a certain extent, but the interest point of the user is difficult to hit, so that the title lacks the attraction of the user.

Many platforms (e.g., enterprise recruitment information platforms, hospital information platforms, e-commerce platforms, online dining platforms, etc.) are not trivial in the role of title text as one of the fastest ways for users to learn information in order to attract users to typically provide relevant content of various items or informational objects, including relevant images, title information, and other descriptive content. For example, the advantages of the articles can be accurately reflected, and the user interest point titles can be grasped, so that more users can be attracted.

Based on this, in the exemplary embodiment of the present disclosure, a title text generation method is first provided. Referring to fig. 1, the title text generating method includes the steps of:

step S110: acquiring the confusion degree of the first-level candidate title;

step S120: filtering the primary candidate titles according to the perplexity to obtain secondary candidate titles;

step S130: acquiring the click probability of the secondary candidate title;

step S140: and sorting the secondary candidate titles based on the click probability, and determining a final title from the sorted secondary candidate titles.

According to the title text generation method in the embodiment, on one hand, primary candidate titles which are relatively discontent and have poor logicality are filtered out according to the confusion degree, and the quality and readability of the final title are improved on the whole; on the other hand, to a certain extent, the click probability of the secondary candidate title characterizes the attraction of the title to the user, so that the final title determined after the ranking based on the click probability fully considers the potential user attraction of the title, and therefore, the readability, the logicality and the attraction to the user of the title generation are balanced.

The title text generation method in the exemplary embodiment of the present disclosure will be further described below in conjunction with the title text generation process of the item object.

In step S110, the confusion of the primary candidate titles is acquired.

In an exemplary embodiment of the disclosure, the first-level candidate titles are target keywords extracted from multiple information sources, and are obtained after preprocessing such as filtering and combining; the confusion degree is generally an index used for measuring the quality of a language model in natural language processing, the confusion degree in the disclosure is used as an index for representing whether a primary candidate title is smooth, clear in logic and readable, the lower the confusion degree is, the higher the index of the corresponding primary candidate title is, such as the titles of 'bluetooth wireless smart headset' and 'bluetooth wireless full-automatic watch', obviously, the latter title has the problems of logic unclear, smoothness and the like, and therefore, the confusion degree of the title of 'bluetooth wireless smart headset' is lower than that of 'bluetooth wireless full-automatic watch'.

Specifically, before the confusion degree of a primary candidate title is obtained, firstly, an article tag word and a target keyword in text information corresponding to a current article are extracted, and the primary candidate title is generated according to the article tag word and the target keyword, wherein the target keyword is a modified phrase or an article attribute word which can accurately reflect the characteristics of the article in some aspect and has correct and strict text description, and comprises a concept word and an article word. For example, concept words: winter, waterproof, spicy and hot and the like; article word: pencil, earphone, computer, etc. Fig. 2 shows a flow chart for forming a primary candidate title, which, as shown in fig. 2, includes the following steps:

in step S210, a target keyword is extracted from the text information corresponding to the current item.

In an exemplary embodiment of the present disclosure, the target keywords include concept words and item words, the concept words are key components constituting the title text, and the extraction of the concept words is performed from multiple information sources by means of a sequence tagging model. The sequence annotation model may include a CRF model (Conditional Random Field), an LSTM model (Long short-Term Memory network), a CRF-BILSTM (Conditional Random Field-Bi-directional Long Term Memory network) model, and so on; the multiple information sources include article supply information, circulation channel information, demand information, and article technical information, etc. Specifically, a plurality of concept words with preset types can be extracted, and table 1 shows the concept words extracted from the multiple information sources, as shown in table 1, a plurality of concept words are extracted from the multiple information sources of the article according to the preset types.

TABLE 1

Preset type	Concept word
		Regional attributes	"USA", "Beijing", "Hai Tao"
Seasonal attribute	Spring and autumn, winter and early spring "
		Attributes of a population	Boy, lovers and the aged "
Scene attributes	Bathroom, kitchen and office "

It should be noted that table 1 is only a partial example of the concept words extracted from the multiple information sources, and more types of concept words may also be extracted from the multiple information sources according to actual requirements, for example, the function attribute (such as bluetooth), the taste attribute (such as spicy), the style attribute (such as vintage), the material attribute (such as plastic), and the style attribute (such as in-ear), and so on. Correspondingly, article words such as "cell phone", "fan", "books", "unmanned aerial vehicle" etc. can be extracted from many information sources, and this disclosure is not repeated here.

In step S220, the target keywords are filtered to obtain a set of compliant keywords.

In an exemplary embodiment of the present disclosure, the target keyword extracted through step S210 may contain an illegal word; alternatively, it does not fit as a vocabulary constituting the title; or the vocabulary of the concept word and the vocabulary of the article word conflict, and therefore, the vocabulary is filtered by screening to obtain a set of compliant keywords. For example, the filtering process is performed on words like "top", "joker", "famous", "exact", and "extreme"; filtering the words with conflicts between concept words such as 'full-automatic' and 'watch' and the object words, and the like.

In step S230, the text information corresponding to the current article is extracted to form an article prefix.

In an exemplary embodiment of the present disclosure, an item tag is a word, symbol, sign, design, or the like that distinguishes an item from other items, and is a comprehensive reflection that an item occupies a certain position among user consciousness; the item tag words are words describing the above features, and may be brand words, for example. Therefore, the method and the device firstly extract the item tag words and form item prefixes from the item tag words according to the preset text format. For example, if the chinese article tag word or the english article tag word is null, acquiring the article tag word fails, and the corresponding title does not set the article prefix; if the Chinese article tag word is consistent with the English article tag word, taking the Chinese article tag word as the article prefix of the current article; if the Chinese article tag word and the English article tag word have an inclusion relationship, taking the article tag word with more text content as the article prefix of the current article; and finally, if the Chinese article tag words are not consistent with the English article tag words, forming article prefixes of the current articles in the form of the Chinese article tag words (English article tag words), and if the prefixes are too long, taking the Chinese article tag words as the article prefixes of the current articles. For example, if the extracted chinese item label is "SONY" and the english item label is "SONY", the formed item prefix is "SONY (SONY)". Of course, other ways of forming the article prefix may be selected according to actual situations, and the present disclosure includes, but is not limited to, the above-mentioned ways of forming the article prefix according to the article tag word.

In step S240, a primary candidate title is generated according to the item prefix and the target keyword.

In an exemplary embodiment of the present disclosure, based on the obtained item prefix, the target keywords (concept word and item word) in the keyword set, the primary candidate titles are formed in a format of "item prefix + concept word + item word". Such as "SONY (SONY) bluetooth headset" and "SONY (SONY) bluetooth wireless headset", in which the numbers of target keywords of the above-mentioned two titles are two and three, respectively. Correspondingly, the primary candidate titles can be divided into a plurality of groups according to the number of the target keywords contained in the primary candidate titles, and each group of primary candidate titles is defined as a first-order text, a second-order text and a third-order text … …; the rank of the first-level candidate title corresponds to the number of concept words in the target keywords contained in the first-level candidate title. The specific format is for example: the first-order text is an item prefix + concept word 1+ item word, such as "SONY (SONY) bluetooth headset"; the second-order text is an article prefix + concept word 1+ concept word 2+ article word, such as "SONY (SONY) bluetooth wireless headset"; the third-order text is an item prefix + a concept word 1+ a concept word 2+ a concept word 3+ an item word, for example, "SONY (SONY) bluetooth wireless sports headset".

In step S250, the obtained primary candidate titles are subjected to filtering processing.

In an exemplary embodiment of the disclosure, the concept words in the primary candidate title are ordered, the priority levels of the corresponding positions of the concept words of different types in the title are different, and the higher the priority level is, the more forward the position of the corresponding concept word in the title is, so that the filtering can be performed according to the type of the concept word in the primary candidate title; in addition, the concept words belonging to the same type appear only once in the title text, and therefore, the primary candidate titles can also be subjected to filtering processing according to the frequency of appearance of the concept words. The present disclosure includes, but is not limited to, the location priorities specified in table 2 below:

TABLE 2

Priority level	Preset type
		3	Regional attributes
2	Seasonal attribute, crowd attribute, scene attribute
		1	Style attribute, material attribute, style attribute, function attribute, taste attribute

For example, if the concept word 1 belongs to the crowd attribute and the concept word 2 belongs to the style attribute, the concept word 1 is located before the concept word 2 in the formed primary candidate headings, for example, "lovers' korean jacket", the primary candidate headings, such as "korean jacket", are filtered out.

It should be noted that, if the primary candidate titles are divided into multiple groups according to the number of the target keywords included in the primary candidate titles, the filtering operation of the primary candidate titles may be performed on each group, so as to obtain multiple groups of primary candidate titles, which is not described in detail in this disclosure.

Further, after a plurality of primary candidate titles are formed through steps S210 to S250, the degree of confusion of the primary candidate titles is acquired. In the present disclosure, the primary candidate title at least comprises two target keywords; the process of acquiring the confusion degree comprises the following steps: firstly, based on a preset title library, calculating the co-occurrence probability among all target keywords (namely target keyword pairs) in the primary candidate title, and determining the confusion degree of the primary candidate title according to the co-occurrence probability. The target keyword pair consists of any two adjacent target keywords in the primary candidate title, namely the target keyword pair sequentially comprises a first keyword and a second keyword; furthermore, if the primary candidate title includes N target keywords, then the title includes N-1 pairs of target keywords. For example, if the primary candidate title is "sony wireless bluetooth headset", the target keyword pair corresponding to the title includes "wireless bluetooth" and "bluetooth headset", then the co-occurrence probabilities of the two target keyword pairs are calculated respectively. Fig. 3 shows a flowchart for calculating the co-occurrence probability of the target keyword pairs in the primary candidate titles, as described in fig. 3, the process includes the following steps:

in step S310, the target item word corresponding to the primary candidate title is acquired.

In an exemplary embodiment of the present disclosure, preset item words exist for different primary candidate titles, and therefore, a target item word corresponding to a primary candidate title may be obtained from a database, and a co-occurrence probability of a target keyword pair at the granularity of the target item word is calculated.

In step S320, a first probability of a first target topic appearing in the preset topic library is calculated, wherein the first target topic is a topic including a target item word and a target keyword pair.

In an exemplary embodiment of the present disclosure, a first probability of occurrence of a first target title containing a target keyword pair and a target item word in a primary candidate title is sequentially calculated, for example, a first probability of occurrence of a first target title containing a target keyword pair of "wireless bluetooth" and "headset" at a granularity of a target item word "headset", i.e., P₁(first keyword, second keyword, target item word).

In step S330, a second probability of a second target title appearing in the preset title library is calculated, wherein the second target title is a title including the target item word and the first keyword.

In an exemplary embodiment of the present disclosure, since the adjacent destinations are sequentially acquired from the head to the tail of the secondary candidate headlineThe keywords are keyed to determine the target keyword pairs, so in a secondary candidate heading such as "Sony Bluetooth Wireless Headset," Wireless "is the second keyword in the target keyword pair" Bluetooth Wireless "and the first keyword in the target keyword pair" Wireless Headset, "then, for any pair of target keyword pairs, the second probability of the second target heading occurring may be denoted as P₂(first keyword, target item word).

In step S340, the first probability is compared with the second probability to obtain a co-occurrence probability.

In an exemplary embodiment of the present disclosure, a co-occurrence probability of a target keyword pair in a primary candidate heading may be evaluated as a degree of confusion of the primary candidate heading under a Unigram probability model (univariate model), and the co-occurrence probability may be obtained based on the obtained first probability and second probability by the following formula:

Uni(Word₁,Word₂|Product_*)＝P(Word₂|Word₁,Product_*)

wherein, Uni (Word)₁,Word₂|Product_*) To represent co-occurrence of target keyword pairs in the primary candidate headings at the target item Word granularity, P (Word) under the Unigram model₂|Word₁,Product_*) For co-occurrence probability, P₁(Word₁,Word₂,Product_*) Is a first probability, P, of occurrence of a first target title₂(Word₁,Product_*) For a second probability of occurrence of a second target title, Word₁As a first keyword, Word₂As a second keyword, Product_*Is the target item word.

And finally, determining the confusion degree of the primary candidate titles according to the co-occurrence probability of the target keyword pairs in the acquired primary candidate titles. Wherein, the target keyword and the target object Word are { Word₁,Word₂,……，Product_*The confusion of the first-order candidate titles can be obtained by taking the reciprocal of the geometric mean of the co-occurrence probability as the confusion, which is shown in the following formula:

among them, Perplexity (Word | Product)_*) For confusion, Word_iAs a first keyword, Word_i+1Is the second keyword.

It should be noted that the confusion of the primary candidate titles can also be determined by calculating the reciprocal of the arithmetic mean, the reciprocal of the squared mean, and the like of the co-occurrence probability, and the present disclosure includes, but is not limited to, the above-mentioned method of determining the confusion of the primary candidate titles according to the co-occurrence probability.

In step S120, the primary candidate headline is filtered according to the confusion degree to obtain a secondary candidate headline.

In an exemplary embodiment of the present disclosure, titles that are more confusing among the obtained primary candidate titles are filtered, and the remaining titles are taken as secondary candidate titles. The process comprises the following steps: firstly, sorting the primary candidate titles from low to high according to the confusion degree to form a first sequence; then, sequentially intercepting a first preset number of primary candidate titles from the first sequence to obtain secondary candidate titles, performing sorting filtering processing on the titles through the confusion degree of the primary candidate titles, reserving the titles with higher smoothness, stronger logicality and higher readability, realizing quality evaluation processing on the title text in the title generation process, and improving the quality of the generated title text.

In step S130, the click probability of the secondary candidate headline is obtained.

In an exemplary embodiment of the present disclosure, the click probability of the secondary target title is an evaluation of the behavior of the item being clicked, and the higher the click probability, the higher the probability that the secondary target title is clicked, which may reflect the attraction degree of the secondary target title to the user to some extent. FIG. 4 is a flow chart illustrating the process of obtaining the click probability of a secondary candidate headline, as shown in FIG. 4, including the steps of: in step S410, a first click rate of a third target title is obtained, where the third target title is a title including a target concept word and a target item word; wherein, the target concept word is any one concept word in the secondary candidate titles; in step S420, a second click rate corresponding to the title including the target item word is obtained; in step S430, comparing the first click rate with the second click rate to obtain a third click rate corresponding to the target concept word; in step S440, the click probability of the secondary candidate title is obtained according to the third click rate.

For example, for the secondary candidate title "sony bluetooth wireless headset", first, a first click rate of a third target title containing the target item word "headset" and containing the target concept word "bluetooth" is calculated; then, calculating a second click rate corresponding to the title containing the target item word 'earphone'; and finally, comparing the first click rate with the second click rate to obtain a third click rate corresponding to the target concept word Bluetooth. Correspondingly, a third click rate corresponding to the target concept word "wireless" can be obtained, which is not described in detail in this disclosure; and finally, determining the click probability of the secondary candidate title according to the obtained third click rate corresponding to each target concept word in the second target title. Wherein, the geometric mean of the third click rates can be obtained and used as the click probability of the secondary candidate title. For the target concept Word, sequentially Word is { Word₁,Word₂… …, the target article word Product_*The click probability of the secondary candidate headline of (1) can be obtained by the following formula:

wherein, P_click(Word|Product_*) Probability of click for a Secondary candidate title, P_click(Word_i|Product_*) The third click rate. Note that it is also possible to determine two by calculating the arithmetic mean, the square mean, or the like of each third click rateThe click probability of the secondary candidate headline, and the present disclosure includes, but is not limited to, the above-described method of obtaining the click probability of the secondary candidate headline according to the third click rate.

In step S140, the secondary candidate titles are ranked based on the click probability, and a final title is determined from the ranked secondary candidate titles.

In an exemplary embodiment of the present disclosure, the secondary candidate titles are ranked based on the obtained click probability to determine a target candidate title from the ranked secondary candidate titles. FIG. 5 is a flow diagram illustrating the determination of target candidate headings based on secondary candidate headings, as shown in FIG. 5, the process comprising:

in step 510, the secondary candidate headlines are sorted from high to low according to the click probability to form a second sequence.

In step 520, the second-level candidate titles in the second sequence are subjected to de-duplication processing according to the confusion degree and the target keyword corresponding to the second-level candidate titles, so as to obtain a third sequence.

In an exemplary embodiment of the present disclosure, the confusion degree corresponding to the second candidate headline is consistent with the confusion degree corresponding to the second candidate headline as the first candidate headline. Wherein the deduplication process includes retaining only the secondary candidate titles with low confusion for different secondary candidate titles with the same target keyword composition, such as "sony wireless bluetooth headset" and "sony bluetooth wireless headset", and the target keyword compositions of the two titles are the same, then retaining only the one with lower confusion.

In step 530, a lowest confusion degree corresponding to the second-level candidate titles in the third sequence is obtained, and the third sequence is filtered according to the lowest confusion degree to obtain a fourth sequence.

In an exemplary embodiment of the present disclosure, first, a lowest confusion degree corresponding to a secondary selected title is obtained; then, the secondary candidate titles corresponding to a degree of confusion higher than the lowest degree of confusion by a preset percentage are deleted to obtain a fourth sequence. The second-level candidate titles corresponding to the perplexity higher than the preset percentage of the minimum perplexity are deleted, so that the perplexity balance among the second-level candidate titles in the fourth sequence can be ensured, that is, the difference between the accuracy, the logic and the readability of each second-level candidate title is not more than a preset threshold.

In step 540, a second preset number of the secondary candidate headlines are sequentially intercepted from the fourth sequence, and the second preset number of the secondary candidate headlines are taken as the target candidate headlines.

In an exemplary embodiment of the disclosure, since the secondary candidate titles in the fourth sequence are obtained after being sorted according to the click probability, a second preset number of secondary candidate titles are directly and sequentially intercepted from the fourth sequence as the target candidate titles, where the second preset number is smaller than or equal to the first preset number. The title is sorted and filtered based on the click probability of the secondary candidate title, the potential attraction of the title text to the user is considered, the title text is made more attractive to a certain extent, in addition, the secondary candidate title is subjected to de-duplication and filtering processing according to the confusion degree and the target keyword composition corresponding to the secondary candidate title, and the overall smoothness, the logic and the readability of the target candidate title are improved.

It should be noted that, with continued reference to step S240, if the primary candidate titles are initially divided into a plurality of groups according to the number of the target keywords included in the primary candidate titles, the subsequent operations of sorting, filtering, and the like based on the confusion and click probability are all performed independently in each group to obtain a plurality of groups of titles with different numbers of the target keywords, so as to provide documents with different lengths for users with different behavioral habits, thereby improving the flexibility of generating and using the title text, which is not described in detail in this disclosure.

In addition, in an exemplary embodiment of the present disclosure, a title text generation apparatus is also provided. Referring to fig. 6, the title text generation apparatus 600 may include a confusion level acquisition module 610, a filtering module 620, a click probability acquisition module 630, and a determination module 640. In particular, the amount of the solvent to be used,

a confusion acquiring module 610 for acquiring a confusion of the first-level candidate title;

a filtering module 620, configured to filter the primary candidate titles according to the perplexity to obtain secondary candidate titles;

a click probability obtaining module 630, configured to obtain a click probability of the secondary candidate title;

a determining module 640, configured to rank the secondary candidate titles based on the click probability, and determine a target candidate title from the ranked secondary candidate titles.

In an exemplary embodiment of the disclosure, the heading text generating apparatus further includes a heading generating module, configured to extract an article tag word and a target keyword in text information corresponding to a current article, and generate the primary candidate heading according to the article tag word and the target keyword.

In an exemplary embodiment of the present disclosure, the primary candidate titles include at least two target keywords; the confusion degree obtaining module may include a co-occurrence probability calculating unit configured to calculate a co-occurrence probability of a target keyword pair in the primary candidate titles. And the confusion degree determining unit is used for determining the confusion degree according to the co-occurrence probability of the target keyword pair, wherein the target keyword pair consists of any two adjacent target keywords in the primary candidate titles.

In an exemplary embodiment of the present disclosure, the target keyword pair includes a first keyword and a second keyword in sequence; the confusion degree obtaining module may further include a target item word obtaining unit configured to obtain a target item word corresponding to the primary candidate title; a first probability obtaining unit, configured to obtain a first probability that a first target title appears in a preset title library, where the first target title is a title including the target item word and the target keyword pair; a second probability obtaining unit, configured to obtain a second probability that a second target title appears in the preset title library, where the second target title is a title including the target item word and the first keyword; a ratio obtaining unit, configured to compare the first probability with the second probability to obtain the co-occurrence probability.

In an exemplary embodiment of the present disclosure, an inverse of the geometric mean of the co-occurrence probabilities is found and the inverse is determined as the degree of confusion.

In an exemplary embodiment of the present disclosure, the first-level candidate titles are plural; the filtering module may include a sorting unit configured to sort the primary candidate titles into a first sequence according to the perplexity from low to high; and the data intercepting unit is used for sequentially intercepting a first preset number of first-level candidate titles from the first sequence and taking the first preset number of first-level candidate titles as the second-level candidate titles.

In an exemplary embodiment of the present disclosure, the target keywords include concept words and item words; the click probability obtaining module may include a first click rate obtaining unit, configured to obtain a first click rate of a third target title, where the third target title is a title including a target concept word and the target item word; the second click rate obtaining unit is used for obtaining a second click rate corresponding to the title containing the target item word; a third click rate obtaining unit, configured to compare the first click rate with the second click rate to obtain a third click rate corresponding to the target concept word; and acquiring the click probability of the secondary candidate title according to the third click probability.

In an exemplary embodiment of the disclosure, the click probability is determined by taking a geometric mean of each of the third click rates and taking the geometric mean as the click probability.

In an exemplary embodiment of the disclosure, the determining module may include a sorting unit configured to sort the secondary candidate titles into a second sequence according to the click probability from high to low; the duplicate removal processing unit is used for carrying out duplicate removal processing on the secondary candidate titles in the second sequence according to the confusion degree and the target keywords corresponding to the secondary candidate titles so as to obtain a third sequence; the filtering unit is used for acquiring the lowest confusion degree corresponding to the secondary candidate titles in the third sequence and filtering the third sequence according to the lowest confusion degree to acquire a fourth sequence; and the data intercepting unit is used for sequentially intercepting the secondary candidate titles with a second preset number from the fourth sequence and taking the secondary candidate titles with the second preset number as the target candidate titles.

Since each functional module of the title text generation apparatus according to the exemplary embodiment of the present disclosure is the same as that in the embodiment of the present invention of the title text generation method, it is not described herein again.

It should be noted that although several modules or units of the title text generation apparatus are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in the exemplary embodiments of the present disclosure, a computer storage medium capable of implementing the above method is also provided. On which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

Referring to fig. 7, a program product 700 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to such an embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting different system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification.

The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：标识文档中的序列标题

Title text generation method and device, computer storage medium and electronic equipment

相关技术

网友询问留言