Text label extraction method and device and storage medium

文档序号：1253141 发布日期：2020-08-21 浏览：9次中文

阅读说明：本技术 文本标签的提取方法及装置、存储介质 (Text label extraction method and device and storage medium ) 是由毛晶晶陈渊淳刚于 2020-04-01 设计创作，主要内容包括：本公开是关于一种文本标签的提取方法及装置、存储介质。该方法包括：对目标文本进行预处理获得所述目标文本的候选标签集；对所述候选标签集中的候选标签进行特征提取,获得所述候选标签的特征集；其中,所述特征集包括：至少两个描述所述候选标签的特征；基于所述候选标签的所述特征集,确定与所述目标文本相匹配的目标标签。通过本公开实施例,能够提高文本标签的提取精确度。(The disclosure relates to a text label extraction method and device and a storage medium. The method comprises the following steps: preprocessing a target text to obtain a candidate tag set of the target text; performing feature extraction on candidate tags in the candidate tag set to obtain a feature set of the candidate tags; wherein the feature set comprises: at least two features describing the candidate tag; determining a target label matching the target text based on the feature set of the candidate label. Through the embodiment of the disclosure, the extraction accuracy of the text label can be improved.)

1. A method for extracting text labels is characterized by comprising the following steps:

preprocessing a target text to obtain a candidate tag set of the target text;

performing feature extraction on candidate tags in the candidate tag set to obtain a feature set of the candidate tags; wherein the feature set comprises: at least two features describing the candidate tag;

determining a target label matching the target text based on the feature set of the candidate label.

2. The method of claim 1, wherein determining a target label matching the target text based on the feature set of the candidate labels comprises:

inputting the feature set of each candidate label into a ranking learning model to obtain a score value of each candidate label;

selecting one or more target tags from the candidate tags that are determined to be the target text based on the scoring value of each of the candidate tags.

3. The method of claim 2, wherein selecting one or more target tags from the candidate tags that are determined to be the target text based on the score value of each of the candidate tags comprises:

normalizing the score value of each candidate label to obtain a normalized score result;

and selecting one or more candidate labels of which the normalized scoring result is greater than a scoring threshold value, and determining the candidate labels as the target labels of the target text.

4. The method of claim 3, wherein selecting one or more of the candidate tags for which the normalized scoring result is greater than a scoring threshold to determine as a target tag of the target text further comprises:

when the normalized scoring result is larger than the scoring threshold and the number of the candidate labels is larger than a number threshold N, selecting N candidate labels with the highest normalized scoring from the candidate labels with the normalized scoring result larger than the scoring threshold to determine the candidate labels as the target labels.

5. The method of claim 2, further comprising:

obtaining a first feature pair of correct labels of at least two sample texts;

obtaining a second feature pair of error labels of the at least two sample texts;

inputting the first feature pair and the second feature pair into a sequencing training model, and training to obtain the arrangement learning model; and the scoring result of the arrangement learning model for scoring the correct label is larger than the scoring result of the arrangement learning model for scoring the wrong label.

6. The method of claim 5, wherein the rank training model is a model formed by optimizing a loss model with a gradient model.

7. The method of any of claims 1 to 6, wherein the features in the feature set comprise at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of a word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the candidate tag is contained in the keyword of the target text;

whether the candidate tag is contained in the expanded keyword of the target text or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

8. An apparatus for extracting text labels, the apparatus comprising:

the preprocessing module is configured to preprocess a target text to obtain a candidate tag set of the target text;

the extraction module is configured to perform feature extraction on candidate tags in the candidate tag set to obtain a feature set of the candidate tags; wherein the feature set comprises: at least two features describing the candidate tag;

a determination module configured to determine a target tag matching the target text based on the feature set of the candidate tag.

9. The apparatus of claim 8, wherein the determining module comprises:

the input module is configured to input the feature set of each candidate label into a ranking learning model to obtain a score of each candidate label;

a first selection module configured to select one or more target tags determined to be the target text from the candidate tags based on the score values of the candidate tags.

10. The apparatus of claim 9, wherein the first selection module comprises:

the processing module is configured to perform normalization processing on the score value of each candidate label to obtain a normalized score result;

and the second selection module is configured to select one or more candidate tags of which the normalized scoring result is greater than a scoring threshold value, and determine the candidate tags as the target tags of the target text.

11. The apparatus of claim 10, wherein the second selection module is further configured to select N candidate tags with the highest normalized score from the candidate tags with the normalized score results larger than the score threshold as the target tag when the normalized score results larger than the score threshold and the number of candidate tags is larger than a number threshold N.

12. The apparatus of claim 9, further comprising:

the first acquisition module is configured to acquire a first feature pair of correct labels of at least two sample texts;

a second obtaining module configured to obtain a second feature pair of error labels of the at least two sample texts;

the training module is configured to input the first feature pair and the second feature pair into a sequencing training model, and train to obtain the arrangement learning model; and the scoring result of the arrangement learning model for scoring the correct label is larger than the scoring result of the arrangement learning model for scoring the wrong label.

13. The apparatus of claim 12, wherein the rank training model is a model formed by optimizing a loss model with a gradient model.

14. The apparatus of any of claims 8 to 13, wherein the features in the feature set comprise at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of a word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the candidate tag is contained in the keyword of the target text;

whether the candidate tag is contained in the expanded keyword of the target text or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

15. An extraction apparatus for text labels, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of extracting text labels of any one of claims 1 to 7.

16. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor, enable the processor to perform the method of extracting a text label according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of natural language processing, and in particular, to a method and an apparatus for extracting text labels, and a storage medium.

Background

With the increasing geometric progression of information in the network era, such as scientific and technical literature, social reasoning, web pages, and the like, analyzing and mining large-scale text data becomes a currently concerned field, and how to effectively represent text information becomes a fundamental and hot problem in research in the field of natural language processing.

In the actual representation text, the text labels are words or phrases which are more refined than the text abstract, and the existing text labels are usually used for representing text information and words or phrases which are interesting to the user, so that the user can be helped to quickly understand the text content and classify and recommend the text through the text labels. Therefore, the accuracy of extracting the text labels directly affects the final effect of recommendation or search of the user.

Disclosure of Invention

The disclosure provides a text label extraction method and device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a method for extracting a text tag, including:

preprocessing a target text to obtain a candidate tag set of the target text;

determining a target label matching the target text based on the feature set of the candidate label.

In some embodiments, the determining a target tag that matches the target text based on the feature set of the candidate tag includes:

inputting the feature set of each candidate label into a ranking learning model to obtain a score value of each candidate label;

selecting one or more target tags from the candidate tags that are determined to be the target text based on the scoring value of each of the candidate tags.

In some embodiments, said selecting one or more target tags from said candidate tags that are determined to be said target text based on said score value of each of said candidate tags comprises:

normalizing the score value of each candidate label to obtain a normalized score result;

In some embodiments, the selecting one or more candidate tags for which the normalized scoring result is greater than a scoring threshold to determine as the target tag of the target text further includes:

In some embodiments, the method further comprises:

obtaining a first feature pair of correct labels of at least two sample texts;

obtaining a second feature pair of error labels of the at least two sample texts;

In some embodiments, the rank training model is a model formed by optimizing a loss model with a gradient model.

In some embodiments, the features in the feature set include at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of a word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the candidate tag is contained in the keyword of the target text;

whether the candidate tag is contained in the expanded keyword of the target text or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for extracting a text label, the apparatus including:

the preprocessing module is configured to preprocess a target text to obtain a candidate tag set of the target text;

a determination module configured to determine a target tag matching the target text based on the feature set of the candidate tag.

In some embodiments, the determining module comprises:

the input module is configured to input the feature set of each candidate label into a ranking learning model to obtain a score of each candidate label;

a first selection module configured to select one or more target tags determined to be the target text from the candidate tags based on the score values of the candidate tags.

In some embodiments, the first selection module comprises:

the processing module is configured to perform normalization processing on the score value of each candidate label to obtain a normalized score result;

In some embodiments, the second selection module is further configured to select, when the normalized scoring result is greater than the scoring threshold and the number of candidate tags is greater than a number threshold N, the N candidate tags with the highest normalized scoring from among the candidate tags whose normalized scoring result is greater than the scoring threshold to be determined as the target tag.

In some embodiments, the apparatus further comprises:

the first acquisition module is configured to acquire a first feature pair of correct labels of at least two sample texts;

a second obtaining module configured to obtain a second feature pair of error labels of the at least two sample texts;

In some embodiments, the rank training model is a model formed by optimizing a loss model with a gradient model.

In some embodiments, the features in the feature set include at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of a word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the candidate tag is contained in the keyword of the target text;

whether the candidate tag is contained in the expanded keyword of the target text or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for extracting a text label, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of extracting text labels as described in the first aspect above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium including:

the instructions in said storage medium, when executed by a processor, enable the processor to perform the method of extracting text labels as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the embodiment of the disclosure, the target tag is selected from the candidate tags based on at least two features describing the candidate tags in the feature set of the candidate tags, on one hand, the candidate tags can be judged by describing the features of the candidate tags, and the purpose of determining the target tag can be achieved, on the other hand, the embodiment of the disclosure does not judge by a single feature, but judges whether the candidate tag is the target tag based on at least two features together, and can improve the accuracy of determining the target tag.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart of a method for extracting text labels according to an embodiment of the present disclosure.

Fig. 2 is a flowchart of a text label extraction method shown in the embodiment of the present disclosure.

Fig. 3 is a flowchart of a method for extracting text labels, shown in the embodiment of the present disclosure.

Fig. 4 is a flowchart of a method for extracting a text label according to the embodiment of the present disclosure.

Fig. 5 is a flowchart of a method for extracting text labels, shown in an embodiment of the present disclosure.

Fig. 6 is a first drawing device for text labels according to an embodiment of the present disclosure.

Fig. 7 is a diagram of a text label extraction device shown in the embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart of a method for extracting a text tag according to an embodiment of the present disclosure, where as shown in fig. 1, the method for extracting a text tag includes the following steps:

s11, preprocessing the target text to obtain a candidate label set of the target text;

s12, extracting the characteristics of the candidate labels in the candidate label set to obtain a characteristic set of the candidate labels; wherein the feature set comprises: at least two features describing a candidate tag;

and S13, determining the target label matched with the target text based on the characteristic set of the candidate label.

The target text comprises news text, periodical text, academic text, prose or report and the like. Illustratively, when the target text is a news text, the target label of the news text can be determined by the text label extraction method provided by the embodiment of the disclosure, and then other news texts can be automatically screened through the target label to recommend interesting news to the user.

In the embodiment of the present disclosure, the preprocessing the target text includes: performing word segmentation processing on the target text to obtain word groups after word segmentation; filtering the word group after word segmentation to obtain a filtered word group; and matching the filtered word group with the labels in the label library to obtain a candidate label set of the target text.

The word segmentation processing comprises the following steps: dividing the language in the target text into a single word according to the grammatical structure of the text, or directly dividing the words contained in the target text into single words, words or phrases according to the granularity from the minimum division to the maximum division. For example, "biologists are doing biological experiments," which is the phrase "biologists, doing, and experiments" after word segmentation; the network business bank is the most important product of the ant golden dress, and the phrase after the word segmentation is the most important product of the network business bank.

The filtering the word group after word segmentation comprises: removing words of a predetermined type from the target text, the predetermined type including but not limited to: a particle and/or an emoticon without an actual meaning. For example, stop words in the segmented phrase are removed, and the stop words include, but are not limited to, idioms, adverbs, prepositions, or conjunctions. For example, when the phrase after word segmentation is "internet commerce bank, yes, ant golden suit, most important, product", the corresponding phrase after filtering is "internet commerce bank, ant golden suit, product". Thus, the noise of the target text can be reduced by filtering the word groups after word segmentation.

The label library may be a manually maintained label library that exists at present. It should be noted that the manually maintained tag library may comprise a plurality of tag libraries, for example, 100 ten thousand tag libraries. The embodiment of the disclosure may select a plurality of tags matched with each word in the filtered phrase from the plurality of tag libraries as a candidate tag set.

In the embodiment of the present disclosure, the feature set may include at least two features describing the candidate tag, which include, but are not limited to, Inverse text word Frequency (TDF), similarity between the target text and the candidate tag, and text rank (TextRank) value.

When the feature is the similarity between the title of the target text and the candidate tag in the similarity between the target text and the candidate tag, performing feature extraction on the candidate tag in the candidate tag set, including: performing word segmentation processing on the title to obtain a first word group; carrying out weighting processing on the word vector of the first phrase to obtain a feature vector of a title of a target text; and determining the similarity between the title of the target text and the candidate label based on the feature vector of the title, the candidate label and the cosine similarity model.

Illustratively, the feature vector V of the title of the target text may be obtained by formula (1)_titleWherein V is_iIs the first phrase_iThe word vector of each word, and n is the total number of words of the first phrase.

When the feature is the similarity between the text of the target text and the candidate tag in the similarity between the target text and the candidate tag, performing feature extraction on the candidate tag in the candidate tag set, including: performing word segmentation processing on the text to obtain a second word group; carrying out weighting processing on the word vector of the second phrase to obtain a characteristic vector of the text of the target text; and determining the similarity between the text of the target text and the candidate label based on the feature vector of the text, the candidate label and the cosine similarity model.

Illustratively, the feature vector V of the body of the target text may be obtained by formula (2)_bodyWherein V is_jIs the word vector of the jth word in the second word group, and m is the total number of words in the second word group.

When the feature is the inverse text word frequency, feature extraction is carried out on the candidate labels in the candidate label set, and the feature extraction comprises the following steps: acquiring the number of texts in a text library; acquiring the number of texts containing candidate labels in a text library; and determining the inverse text word frequency based on the number of texts in the text library and the number of texts containing the candidate labels.

Illustratively, the inverse text word frequency IDF may be obtained by formula (3)_tWherein, the text set is omega, and N is in the text setNumber of whole text, N_tIs the number of texts containing candidate tags.

And when the feature is a text sorting TextRank value, performing characteristic extraction on the candidate tags in the candidate tag set through a formula (4) to obtain the TextRank value.

Here, it is assumed that the target text composed of words of which parts of speech are specified in the candidate tag is denoted by Doc ═ { w1, w2, w3 … wn }, and words w1, w2, and w3 … wn can be regarded as one node. The window size is set to k, where w1, w2 … wk, w2, w3 … wk +1, w3, w4 … wk +2, and the like are all windows. There is an undirected and unweighted edge, TR (V), between nodes corresponding to any two words in a window_i) Represents a node V_iTextRank value of (TR) (V)_j) Represents a node V_jA TextRank value of, d represents a damping coefficient, typically set to 0.85; ln (V)_i) A set of predecessor nodes that are nodes; out (V)_j) Set of nodes being successors of nodes, V_iIs the ith predecessor node, V_jIs the j-th successor node, the successor node is the node in the target text, the successor node is the node linked to the target text, w_jiAnd w_jkWeight of the edge, w_jiIs the similarity between sentences, w_jkCan be considered as 1.

In the embodiment of the disclosure, after the feature set of the candidate tag is obtained, the target tag matched with the target text can be determined based on the feature set. The features in the feature set are used for describing the candidate tags, and may be used for evaluating each index of each candidate tag, and further determining whether the candidate tag is the target tag by synthesizing each index.

It should be noted that, the frequency, the position, the part of speech, the similarity between the candidate tag and the target text, and the like of the candidate tag appearing in the target text can all describe the candidate tag, and can all affect whether the candidate tag can become the target tag. Therefore, in the embodiment of the present disclosure, the frequency of the candidate tag appearing in the target text (for example, the word frequency, the inverse word frequency, and the like of the candidate tag), the position (for example, whether the candidate tag is in the title of the article, the position of the candidate tag appearing in the article for the first time, the position of the candidate tag appearing in the article for the last time, the position/number of sentences of the candidate tag appearing in the body for the first time, the position/number of sentences of the candidate tag appearing in the body for the last time, and the like) may be used as the features of the candidate tag, and whether the candidate tag is the target tag may be determined by the feature set composed of the features of the plurality of candidate tags. In this way, the candidate tag is evaluated based on the plurality of features, and whether the candidate tag is the target tag can be determined more accurately.

The existing text label extraction mainly comprises the following steps: unsupervised tag extraction from text and supervised tag extraction from text. In unsupervised extraction of tags from text, the tags are usually extracted by statistical word frequency-inverse word frequency, by word graph model, or based on topic model. For example, the idea of extracting labels by statistical word frequency-inverse word frequency is: if a word or phrase appears frequently in a text and rarely in other texts, the word or phrase is considered to be able to summarize the content of the text well. Although the existing unsupervised tag extraction process from the text is simple, in practical application, the tag extraction is based on only a single feature, for example, the word frequency-inverse word frequency extraction tag is a weighting mode which tries to suppress noise, and is prone to words with low frequency in the text, and the tag extraction mode only depends on the number of texts in the pre-material library to determine the tag, so that the problem of low accuracy in tag extraction through the existing unsupervised mode exists.

In the process of extracting labels from texts in a supervision mode, label extraction is generally regarded as a binary classification problem, and words or phrases in the texts can be judged to be incapable of being used as the labels in modes of naive Bayes, decision trees, support vector machines and the like. In the process, the candidate labels in the text are directly classified, and classification is not performed based on the characteristics of a plurality of candidate labels, so that the problem of low precision exists.

Based on this, the embodiment of the disclosure selects the target tag from the candidate tags based on at least two features describing the candidate tags in the feature set of the candidate tags, and can determine whether the candidate tag is the target tag based on more features, thereby enabling the determined target tag to be more accurate.

In some embodiments, as shown in fig. 2, determining a target tag matching the target text based on the feature set of the candidate tag, step 13, includes:

s13a, inputting the feature set of each candidate label into the arrangement learning model to obtain the score of each candidate label;

and S13b, selecting one or more target labels determined as target texts from the candidate labels based on the score values of the candidate labels.

In the embodiment of the present disclosure, the permutation learning model is a model obtained by training a sample text and a permutation training model, where the permutation training model includes a LambdaMART model, a Gradient decision tree (GBDT) model, a lightweight Gradient enhancement machine support vector machine model, and a deep neural network model or a convolutional neural network model in a deep learning-based classification model, and the embodiment of the present disclosure is not limited.

Taking the LambdaMART model as an example, the LambdaMART model may be composed of two parts, one part is to use a multiple-accumulation regression Tree (MART), i.e. a Gradient Decision lifting Tree (GBDT), as a bottom-layer training model, and the other part is to use Lambda as a Gradient used in a GBDT solution process, where the Lambda is to quantify a direction and strength that a candidate tag to be ranked should be adjusted in the next iteration.

It should be noted that, since Lambda inputs are candidate tag pairs, and the loss function involved in the calculation is to evaluate the difference between the prediction accuracy and the true accuracy of the ranking results of the candidate tag pairs, it is desirable to minimize the incorrect ranking results in the candidate tag pairs. Therefore, the candidate labels are scored by using the permutation learning model obtained by the LambdaMART model training, the relative relation between two candidate labels in the candidate label set can be considered, and the extraction accuracy of the text labels is improved.

In one embodiment, as shown in fig. 3, the method further comprises:

s15, acquiring a first feature pair of correct labels of at least two sample texts;

s16, acquiring second feature pairs of error labels of at least two sample texts;

s17, inputting the first feature pair and the second feature pair into a sequencing training model, and training to obtain the arrangement learning model; and the scoring result of the arrangement learning model for scoring the correct label is larger than the scoring result of the arrangement learning model for scoring the wrong label.

The error label may be a label other than the correct label. The error label may be composed of any one or more words that appear in the sample text but are not capable of labeling the sample text. The feature set of the error label may be composed of various features according to the frequency of appearance, position, similarity with the title and/or body, or IDF of the word corresponding to the error label in the sample text.

In the embodiment of the disclosure, the correct labels and the error labels of at least two sample texts are obtained, and the obtaining process can be realized in a manner of manually extracting the labels, wherein the labels capable of accurately reflecting the sample texts are used as the correct labels, and the labels incapable of accurately reflecting the sample texts are used as the error labels.

In some embodiments, the rank training model is a model formed by optimizing a loss model with a gradient model.

In the disclosed embodiment, the ranking training model may be a LambdaMART model. In the Lambdamart model, the above loss model can be expressed by equation (5a), where P_ijThe probability of i being in the set before j.

The above gradient model can be expressed by formula (5b), where λ_iIs a set of index pairs { i, j }, λ_ijThe gradient of index pairs { i, j }, i being the row number of an index pair in the set, and j being the column number of an index pair in the set.

Exemplarily, the set I { {1,2}, {2,3}, {1,3} }, then λ₁＝λ₁₂+λ₁₃，λ₂＝λ₂₃-λ₁₂，λ₃＝-λ₂₃-λ₁₃。

Compared with the existing word frequency-inverse word frequency mode, the candidate labels are scored through the formula (6), in the embodiment of the disclosure, the scoring value is obtained only by multiplying the characteristics of the candidate labels according to the characteristics of the candidate labels, but the scoring value can be obtained by comprehensively analyzing each characteristic in the characteristic set through inputting the permutation learning model in consideration of the relation between the two characteristics, and therefore the extraction accuracy of the text labels can be improved.

S_t＝TF_t*IDF_t(6)

Wherein S is_tFor scoring values, TF, corresponding to the word frequency-inverse word frequency mode_tInverse text word frequency IDF as the frequency of occurrence of candidate words in the target text_t。

In the embodiment of the disclosure, after the score value of each tag in the candidate tag set is obtained, one or more candidate tags may be directly selected from the candidate tag set as the target tag according to the score value. In some embodiments, as shown in fig. 4, selecting one or more target tags determined as target texts from the candidate tags based on the score values of the candidate tags, i.e., step S13b, includes:

s13b1, carrying out normalization processing on the scoring values of the candidate labels to obtain a normalized scoring result;

s13b2, selecting one or more candidate labels with the normalized scoring result larger than the scoring threshold value, and determining the candidate labels as the target labels of the target text.

In the embodiment of the present disclosure, the normalization process is to change the score value of each candidate tag to a decimal between 0 and 1. Normalizing the scoring value of each candidate label to obtain a normalized scoring result, wherein the normalized scoring result comprises the following steps: and acquiring the highest score value and the lowest score value of each candidate label, and determining the normalized scoring result of each candidate label based on the highest score value and the lowest score value.

Illustratively, the normalized scoring result x' of each candidate tag may be obtained by formula (7) or formula (8), where x is the scoring value of each candidate tag, and x is_minIs the lowest score value, x_maxThe highest score value.

In the embodiment of the present disclosure, after the normalized scoring result of each candidate tag is obtained, the candidate tag corresponding to the normalized scoring result higher than the scoring threshold may be used as the target tag of the target text.

Illustratively, the scoring threshold may be set according to the precise requirement of actually extracting the tag, for example, the scoring threshold may be set to 0.65 or 0.75, and the embodiments of the present disclosure are not limited.

In some embodiments, as shown in fig. 5, selecting one or more candidate tags whose normalized scoring result is greater than the scoring threshold value, and determining as the target tag of the target text, i.e., S13b3, further includes:

and S13b3, when the normalized scoring result is greater than the scoring threshold and the number of the candidate labels is greater than the number threshold N, selecting N candidate labels with the highest normalized scoring from the candidate labels with the normalized scoring result greater than the scoring threshold to determine the N candidate labels as target labels.

In the embodiment of the present disclosure, in the process of determining a plurality of candidate tags whose normalized scoring results are greater than a scoring threshold as target tags of a target text, too many target tags may be obtained, and then there are problems of low push efficiency or few pieces of labels, and the like.

Illustratively, the number threshold N is a positive integer, and the N may be set according to actual requirements, for example, may be set to 5 or 8, and the embodiment of the present disclosure is not limited.

In this disclosure, the process of selecting the N candidate tags with the highest normalized score may include performing reduced ranking on the scoring results larger than the scoring threshold, and sequentially selecting the first N candidate tags.

It should be noted that, the higher the normalized scoring result is, the better the candidate tag corresponding to the normalized scoring result can reflect the target text. Therefore, the embodiment of the present disclosure selects the N candidate tags with the highest normalized score to be determined as the target tags. Therefore, on one hand, the number of the target labels can be reduced, the efficiency of classification or recommendation based on the target labels can be improved, on the other hand, the highest N candidate labels are selected to be determined as the target labels, and the selection precision of the target labels can be improved.

Exemplarily, assuming that 1000 texts are randomly extracted, and the 1000 texts are subjected to label extraction through the text label extraction, the unsupervised learning extraction label and the manual extraction label provided by the embodiment of the disclosure, as shown in table 1, the text label extraction provided by the embodiment of the disclosure is better than the existing unsupervised learning extraction label in recall rate, precision and comprehensive score.

TABLE 1

Means for	Recall rate	Rate of accuracy	Composite score
				Unsupervised learning text label extraction	0.61	0.55	0.58
Text label extraction for embodiments of the present disclosure	0.80	0.75	0.77

In some embodiments, the features in the feature set include at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of the word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the keywords of the target text contain the candidate tags or not;

whether the extended keywords of the target text contain the candidate tags or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

The similarity between the candidate tag and the target text comprises: similarity between the candidate tag and the title of the target text, similarity between the candidate tag and the body of the target text, similarity between the candidate tag and the primary classification of the target text, and similarity between the candidate tag and the secondary classification of the target text.

The position of the word corresponding to the candidate tag appearing in the target text comprises: the position where the candidate tag appears in the target text for the last time and the position where the candidate tag appears in the target text for the first time.

It should be noted that the feature set of the embodiment of the present disclosure may include, in addition to the above features: the word corresponding to the candidate tag is in the title of the target text, the word frequency of the word corresponding to the candidate tag in the target text, the TextRank value, the number of sentences of the candidate tag appearing in the target text for the first time, and the number of sentences of the candidate tag appearing in the document for the last time. Illustratively, the features extracted from the candidate tags by the embodiments of the present disclosure are as shown in table 2.

TABLE 2

Feature(s)	Explanation of the invention
		TFIDF	Word frequency-inverse text frequency
TEXTRANK	TextRank value
		IN_TITLE	Whether in the article title
FIRST_POS	Position of first appearance in document
		TERM_FREQ	Word frequency
TERM_LENGTH	Label length
		TITLE_SIMILAR	Similarity between label word vector and title vector
IS_ENTITY	Whether it is a real word
		LAST_POS	Last appearing position in document
NORMAL_FIRST_POS	Number of positions/sentences appearing in a document for the first time
		NORMAL_LAST_POS	Last occurrence of position/sentence number in document
BODY_SIM	Similarity of label word vector and text vector
		IDF	Inverse text word frequency
IN_KEYWORDS	Whether or not in keywords of the document
		IN_EXT_KEYWORDS	Whether or not in the extended keywords of the document
CAT_SIM	Similarity of label word vector and document primary classification vector
		SUB_CAT_SIM	Similarity of label word vector and document secondary classification vector

According to the embodiment of the disclosure, whether the candidate tag is the target tag can be determined through the feature set formed by the 17 features and the ranking learning model, so that more features are provided for the ranking learning model to score the candidate tag, and the accuracy of tag extraction can be improved.

Fig. 6 is a diagram illustrating an apparatus for extracting text labels according to an exemplary embodiment. Referring to fig. 6, the text label extracting apparatus includes a preprocessing module 1001, an extracting module 1002, and a determining module 1003, wherein,

a preprocessing module 1001 configured to preprocess a target text to obtain a candidate tag set of the target text;

the extracting module 1002 is configured to perform feature extraction on candidate tags in the candidate tag set to obtain a feature set of the candidate tags; wherein the feature set comprises: at least two features describing the candidate tag;

the determining module 1003 is configured to determine a target tag matched with the target text based on the feature set of the candidate tag.

In some embodiments, the determining module comprises:

the input module is configured to input the feature set of each candidate label into a ranking learning model to obtain a score of each candidate label;

a first selection module configured to select one or more target tags determined to be the target text from the candidate tags based on the score values of the candidate tags.

In some embodiments, the first selection module comprises:

the processing module is configured to perform normalization processing on the score value of each candidate label to obtain a normalized score result;

In some embodiments, the apparatus further comprises:

the first acquisition module is configured to acquire a first feature pair of correct labels of at least two sample texts;

a second obtaining module configured to obtain a second feature pair of error labels of the at least two sample texts;

In some embodiments, the rank training model is a model formed by optimizing a loss model with a gradient model.

In some embodiments, the features in the feature set include at least one of:

similarity between the candidate tag and the target text;

a part-of-speech indication of a word corresponding to the candidate tag;

the position of the word corresponding to the candidate label in the target text;

the occurrence frequency of the words corresponding to the candidate labels in the target text;

whether the candidate tag is contained in the keyword of the target text;

whether the candidate tag is contained in the expanded keyword of the target text or not;

the length of the word corresponding to the candidate tag;

and the word frequency of the inverse text of the word corresponding to the candidate label.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a diagram of a text label extraction device shown in the embodiment of the present disclosure. For example, the apparatus 1900 may be provided as a server. Referring to fig. 7, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the text label extraction method of one or more embodiments described above.

The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：烧录文档编号自动生成方法及其系统

Text label extraction method and device and storage medium

相关技术

网友询问留言