Text recognition method and device, electronic equipment and storage medium
阅读说明:本技术 文本识别方法、装置、电子设备及存储介质 (Text recognition method and device, electronic equipment and storage medium ) 是由 刘春� 于 2019-07-04 设计创作,主要内容包括:本申请示出了一种文本识别方法、装置、电子设备及存储介质,所述文本识别方法包括:获取待识别文本的基本特征集合;生成与待识别文本对应的文字文本;分别从待识别文本和文字文本中提取连续重复子序列特征;基于连续重复子序列特征和基本特征集合进行特征聚类,得到聚类结果,并基于聚类结果检测待识别文本是否为包含重复序列的文本。基于连续重复子序列特征及基本特征集合进行特征聚类,确定待识别文本的类型,由于基本特征集合能够体现霸屏、刷队类评论特殊符号多的特点,连续重复子序列特征能够体现霸屏、刷队类评论重复率高的特点,因此,本申请能够更加准确地识别出霸屏、刷队类垃圾评论文本。(The application discloses a text recognition method, a text recognition device, electronic equipment and a storage medium, wherein the text recognition method comprises the following steps: acquiring a basic feature set of a text to be recognized; generating a text corresponding to the text to be recognized; respectively extracting continuous repeated subsequence features from the text to be recognized and the text; and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is the text containing the repeated sequences based on the clustering result. The method comprises the steps of clustering features based on continuous repeated subsequence features and basic feature sets, determining the type of a text to be identified, and identifying spam comment texts of a overlook screen and a fleeting team due to the fact that the basic feature sets can reflect the characteristic that special symbols of the overlook screen and the fleeing team comment are many and the continuous repeated subsequence features can reflect the characteristic that the repetition rate of the overlook screen and the fleeing team comment is high.)
1. A method of text recognition, the method comprising:
acquiring a basic feature set of a text to be recognized, wherein the basic feature set is a set of characters contained in the text to be recognized and length and proportion features of symbols of each preset type;
generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;
extracting continuous repeated subsequence features from the text to be recognized and the character text respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is a text containing repeated sequences based on the clustering result.
2. The text recognition method of claim 1, wherein the step of obtaining the basic feature set of the text to be recognized comprises:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
3. The text recognition method of claim 1, wherein the step of extracting the continuous repeated subsequence features from the text to be recognized and the text respectively comprises:
respectively generating character sequences of the text to be recognized and the literal text;
when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions, determining the target subsequences as continuous repeated similar subsequences of the corresponding text;
and determining the repetition times and the lengths of the continuous repeated similar subsequences with the maximum repetition times in the corresponding texts and the ratios in the corresponding texts as the continuous repeated subsequence characteristics of the corresponding texts.
4. The text recognition method of claim 3, wherein when the length and the similarity between two consecutive target subsequences in the character sequence both satisfy a preset condition, the step of determining the target subsequences as consecutive repeated similar subsequences of the corresponding text comprises:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
5. The method according to claim 4, wherein before the step of determining the first target subsequence having a preset length from the first position in the first suffix tree sequence and the second target subsequence having the preset length from the first position in the second suffix tree sequence, the method further comprises:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
6. The method according to any one of claims 1 to 5, wherein the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set comprises:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
7. The method according to claim 6, wherein before the step of performing feature clustering on the continuous repeated subsequence features and the basic feature set by using the pre-trained clustering model, the method further comprises:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.
8. A text recognition apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a comparison module and a recognition module, wherein the acquisition module is configured to acquire a basic feature set of a text to be recognized, and the basic feature set is a set of characters contained in the text to be recognized and length and ratio features of symbols of each preset type;
the generating module is configured to generate a text corresponding to the text to be recognized, wherein the text comprises characters of the text to be recognized and does not comprise symbols of each preset type;
the extraction module is configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and the clustering module is configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the text recognition method of any one of claims 1-7.
10. A non-transitory computer readable storage medium, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of any one of claims 1-7.
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text recognition method and apparatus, an electronic device, and a storage medium.
Background
In the prior art, the freely published comments of the users in the social platform greatly improve the viewing experience of the users, connect the users and the authors, and realize the social relationship between the users. However, spam comments such as screen overlooking, team brushing and the like issued by some users seriously affect the experience of normal users.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides a text recognition method, a text recognition device, an electronic device and a storage medium.
According to a first aspect of the present application, there is provided a text recognition method, the method comprising:
acquiring a basic feature set of a text to be recognized, wherein the basic feature set is a set of characters contained in the text to be recognized and length and proportion features of symbols of each preset type;
generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;
extracting continuous repeated subsequence features from the text to be recognized and the character text respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is a text containing repeated sequences based on the clustering result.
In an optional implementation manner, the step of obtaining the basic feature set of the text to be recognized includes:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
In an optional implementation manner, the step of extracting continuous repeated subsequence features from the text to be recognized and the text respectively includes:
respectively generating character sequences of the text to be recognized and the literal text;
when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions, determining the target subsequences as continuous repeated similar subsequences of the corresponding text;
and determining the repetition times and the lengths of the continuous repeated similar subsequences with the maximum repetition times in the corresponding texts and the ratios in the corresponding texts as the continuous repeated subsequence characteristics of the corresponding texts.
In an optional implementation manner, when there is a character sequence in which the length and the similarity between two consecutive target subsequences both satisfy a preset condition, the step of determining the target subsequences as consecutive repeated similar subsequences of the corresponding text includes:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
In an optional implementation manner, before the step of determining a first target subsequence having a preset length from the first position in the first suffix tree sequence and a second target subsequence having the preset length from the first position in the second suffix tree sequence, the method further includes:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
In an optional implementation manner, the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set includes:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
In an optional implementation manner, before the step of performing feature clustering on the continuous repeated subsequence feature and the basic feature set by using a clustering model obtained through pre-training, the method further includes:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.
According to a second aspect of the present application, there is provided a text recognition apparatus, the apparatus comprising:
the system comprises an acquisition module, a comparison module and a recognition module, wherein the acquisition module is configured to acquire a basic feature set of a text to be recognized, and the basic feature set is a set of characters contained in the text to be recognized and length and ratio features of symbols of each preset type;
the generating module is configured to generate a text corresponding to the text to be recognized, wherein the text comprises characters of the text to be recognized and does not comprise symbols of each preset type;
the extraction module is configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and the clustering module is configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.
In an optional implementation, the obtaining module is further configured to:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
In one optional implementation, the extraction module includes:
a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;
the second unit is configured to determine that the target subsequence is a continuous repeated similar subsequence of the corresponding text when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions;
and a third unit configured to determine the repetition number, the length and the ratio in the corresponding text of the continuous repeated similar subsequence with the largest repetition number in the corresponding text as the continuous repeated subsequence feature of the corresponding text.
In an optional implementation, the second unit is further configured to:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
In an optional implementation, the second unit is further configured to:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
In an optional implementation, the clustering module is further configured to:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
In an optional implementation, the apparatus further comprises a training module configured to:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.
According to a third aspect of the present application, there is provided an electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the text recognition method according to the first aspect.
According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of the first aspect.
According to a fifth aspect of the present application, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method according to the first aspect.
The technical scheme provided by the application can comprise the following beneficial effects:
according to the technical scheme, feature clustering is carried out on the basis of the continuous repeated subsequence features and the basic feature set, the type of the text to be recognized is determined, the basic feature set can reflect the characteristic that the number of the special symbols of the overlook screen and the type of team-brushing comments is large, and the continuous repeated subsequence features can reflect the characteristic that the repetition rate of the overlook screen and the type of team-brushing comments is high, so that the overlook screen and the type of team-brushing spam comment text can be recognized more accurately.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 shows several kinds of highlight comment texts in an embodiment of the present application.
Fig. 2 is several types of brushing comment texts shown in the embodiment of the present application.
Fig. 3 is a flowchart illustrating steps of a text recognition method according to an embodiment of the present application.
Fig. 4 is a flowchart illustrating a procedure for extracting a feature of a consecutive repeated sub-sequence according to an embodiment of the present application.
Fig. 5 is a flowchart illustrating a step of determining a consecutive repeated similar sub-sequence according to an embodiment of the present application.
Fig. 6a is a graph of an autocorrelation function of a continuously repeated identical subsequence as shown in an embodiment of the present application.
Fig. 6b is a graph of an autocorrelation function of a continuously repeated similar subsequence as shown in an embodiment of the present application.
Fig. 7 is a flowchart illustrating a step of obtaining a clustering model according to an embodiment of the present application.
Fig. 8 is a distribution diagram of the overall ratio of the continuously repeated similar subsequences in the training sample shown in the embodiment of the present application.
Fig. 9 is a block diagram illustrating a structure of a text recognition apparatus according to an embodiment of the present application.
Fig. 10 is a block diagram of an electronic device according to an embodiment of the present application.
Fig. 11 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
According to the difference of the comment feature extraction method, the existing spam comment identification method mainly comprises three types of methods based on rules, word frequency and spam vocabulary distribution features and comment semantic distribution features. The automatic spam comment classification method based on word distribution and document characteristics achieves automatic spam comment classification by counting keyword frequency distribution and document characteristics of network comments and conducting Bayesian classification. According to the comment similarity-based spam comment identification method and device, comment probability distribution is obtained through training a language model of comment data, a comment probability distribution library is built, and then comments are determined by comparing the comment probability distribution with the comment probability distribution similarity in the library to achieve spam comment detection. The method comprises the steps of firstly obtaining multi-vector representation of a text, and then classifying the multi-vector by using a classifier. A spam comment detection model based on a hierarchical attention mechanism neural network obtains semantic representation characteristics of sentences through a hierarchical attention based neural network HANN and realizes classification.
The Spam comments comprise categories such as overlook screens, fleeing, cheating, low customs and Spam, and the existing Spam comment detection technical scheme is a general detection method for realizing the Spam comments of different categories. In order to realize the general detection of the spam comments of different categories, the detection is realized by extracting comment words or sentence distribution expression characteristics and the like for classification.
FIG. 1 illustrates several kinds of empress-screen comments, which serve the purpose of holding the comment space by posting large space meaningless information. The inventor finds that the key characteristics of the overlook screen comment are that the special symbols such as spaces, tabulation carriage returns and the like have extremely high proportion, and part of the overlook screen comment can continuously generate a plurality of identical special symbols and punctuation mark sequences.
Fig. 2 shows several swizzle reviews, which the inventors found to be typically composed of identical or similar string repeats. According to the repeated composition of the same or similar character strings, the brushing comments can be classified into the same character category or the similar character category.
Therefore, the overlook screen and the team-brushing comment have the characteristics of many special symbols and high repetition rate, the prominent features of the overlook screen and the team-brushing comment cannot be reflected when the prior art scheme is used for detecting the overlook screen and the team-brushing comment, and high-precision detection cannot be realized.
In order to solve the above technical problem, a text recognition method provided by an embodiment of the present application is shown with reference to fig. 3, and the method includes the following steps.
In step S301, a basic feature set of the text to be recognized is obtained, where the basic feature set is a set of length and ratio features of characters and symbols of each predetermined type included in the text to be recognized.
The basic feature set may include, for example: chinese characters, letters, numbers, emoji characters, punctuation marks, spaces, tabulation carriage returns and other symbols, the length of each of 8 categories and the proportion in the text to be recognized.
Specifically, the text to be recognized may be preprocessed according to the character categories, that is, the text to be recognized is divided into 8 categories, that is, the categories include a Chinese character, a letter, a number, an emoji character, a punctuation mark, a space, a tabulation carriage return symbol, and other symbols, and then the length of each category and the ratio of the category to the text to be recognized are respectively calculated, so as to determine the basic feature set.
When the basic feature set is the combination of the text length, the text proportion (or non-text proportion), the maximum length of the continuous special symbol sequence and the special character proportion, the better recognition effect can be achieved, and the recognition accuracy is higher. The length of the text of the characters can be, for example, the length of continuous or discontinuous Chinese character text in the text to be recognized; the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols in the text to be recognized except Chinese characters, letters and emoticons because the letters and emoji symbols generally have meanings in the text. In this case, the step of acquiring the basic feature set of the text to be recognized may include:
calculating the length of a character text (such as the length of a Chinese character text) contained in the text to be recognized and the maximum length of a continuous special symbol sequence; calculating a first proportion of the text contained in the text to be recognized according to the length of the text and the length of the text to be recognized; calculating a second proportion of the continuous special symbol sequence in the text to be recognized according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized; determining the text length, the maximum length of the continuous special symbol sequence, the first proportion and the second proportion as the elements of the basic feature set.
In step S302, a text corresponding to the text to be recognized is generated, where the text includes the text to be recognized and does not include symbols of each predetermined type.
Specifically, the text may be a text obtained by removing special characters from the text to be recognized.
In step S303, continuous repeated subsequence features are extracted from the text to be recognized and the text of the characters, respectively, where the continuous repeated subsequence features are used to represent information of repeated occurrences of characters and symbols of each predetermined type in the corresponding text.
The continuous repeated sub-sequence features may include the number of times of repeating the same sub-sequence or the similar sub-sequence, the length, and the ratio in the corresponding text.
In practical application, a repeated text search algorithm can be adopted to respectively extract target subsequences which continuously appear and have the length and the similarity meeting preset conditions from character sequences generated according to a text to be recognized and a text, and then continuous repeated subsequence characteristics of the corresponding text are determined according to the repetition times and the length of the target subsequences and the proportion of the target subsequences in the corresponding text. The repeated text search algorithm can be a suffix tree algorithm, and can also be a repeated text search algorithm based on autocorrelation period estimation and jaccard similarity measurement. The following embodiments will describe the extraction process of the continuously repeated sub-sequence features in detail.
In step S304, feature clustering is performed based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and whether the text to be identified is a text containing a repeated sequence is detected based on the clustering result.
In practical application, a clustering algorithm such as a neighbor propagation algorithm, Mean-shift and the like can be adopted to perform feature clustering on the continuous repeated subsequence features and the basic feature set, and a clustering model obtained by pre-training can be adopted to perform feature clustering on the continuous repeated subsequence features and the basic feature set.
Specifically, whether the text to be recognized is the rob-screen refreshing comment or the rob-screen comment can be determined according to the clustering result, and further, whether the text to be recognized belongs to the rob-screen refreshing comment or the rob-screen comment can be determined according to whether the text to be recognized contains the repeated sequence, for example, when the text to be recognized contains the repeated sequence, the text to be recognized can be determined as the rob-screen refreshing comment, otherwise, the text to be recognized is the rob-screen comment.
The text recognition method provided by the embodiment can be used for recognizing comments such as e-commerce user comments, microblog user comments, social media comments, short video comments and the like.
According to the text recognition method provided by the embodiment, feature clustering is carried out on the basis of the continuous repeated subsequence feature and the basic feature set, and the type of the text to be recognized is determined.
Referring to fig. 4, step 303 may further include:
in step S401, character sequences of the text to be recognized and the text of the word are generated, respectively.
Specifically, the step may include: generating a character sequence of the text to be recognized, and generating a character sequence of the literal text in the text to be recognized. The text is obtained by removing special characters in the text to be recognized. And meanwhile, the character sequences of the text to be recognized and the literal text are generated, so that the spam comment recognition accuracy can be further improved.
In step S402, when there are two consecutive target subsequences in the character sequence, the length and the similarity of which both satisfy the preset condition, the target subsequences are determined to be consecutive repeated similar subsequences of the corresponding text.
In practical applications, the steps may specifically include: when the length and the similarity between two continuous target subsequences in the character sequence of the text to be recognized meet preset conditions, the target subsequences are determined to be continuous repeated similar subsequences of the text to be recognized, and when the length and the similarity between two continuous target subsequences in the character sequence of the text meet the preset conditions, the target subsequences are determined to be continuous repeated similar subsequences of the text.
The preset condition may be, for example, a length of a repeating character string in the character sequence, that is, a repeating period, a length value near the repeating period, and the like, and the similarity is greater than or equal to a preset threshold, for example, 70%. The preset condition may be determined according to actual conditions, and this embodiment is not particularly limited.
In step S403, the repetition number and length of the continuous repeated similar subsequence having the largest repetition number in the corresponding text and the ratio in the corresponding text are determined as the continuous repeated subsequence feature of the corresponding text.
In practical application, when a plurality of continuous repeated similar subsequences exist in the character sequence of the text to be recognized, the continuous repeated subsequence feature of the text to be recognized can be a combined feature of the repeat times, the length, the percentage and the like of the continuous repeated subsequence with the largest repeat time in the text to be recognized.
When there are a plurality of continuous repeated similar subsequences in the character sequence of the text, the continuous repeated subsequence feature of the text can be a combined feature of the repeat times, the length, the ratio in the text and other features of the continuous repeated subsequence with the largest repeat time in the text.
The following describes a process of extracting the feature of the continuous repeated subsequence by taking a text to be recognized as an example. Assuming that the preset condition is that the length of the target subsequence is 2, and the similarity is 100%, i.e. the two target subsequences are identical.
When the text to be recognized is xy% z & x y @ x # yz, the character sequence s ' ═ { s0 ', s1 ', s2 ', … … s12 ' } of the text to be recognized is generated, wherein s0 ' ═ x, s1 ' ═ y, s2 ' ═ …, s12 ' ═ z. Two continuous target subsequences with the length and the similarity meeting the preset conditions do not exist in the character sequence, so that continuous repeated similar subsequences do not exist.
Extracting the text to be recognized as xyz xyyxyz from the text to be recognized, and generating a character sequence of the text, s { s0, s1, s2, … … s7}, wherein s0 ═ x, s1 ═ y, s2 ═ z, …, and s7 ═ z. Two target subsequences which are continuous and have the length and the similarity meeting the preset conditions are present in the character sequence, namely s3s4(xy) and s5s6(xy), so that a continuous repeated similar subsequence xy is present in the character sequence of the literal text, and xy is the continuous repeated subsequence with the largest repetition number in the character sequence. Furthermore, the continuous repeated similar subsequence xy with the largest repetition number is determined to have the repetition number of 2, the length of 2-bit characters and the proportion of 25 percent in the character sequence.
When the similarity is 100%, extracting the continuous repeated similar subsequence, that is, extracting the continuous repeated identical subsequence, which may be specifically described as "knowing a character sequence, solving for the 2-bit character string with the largest number of continuous occurrences in the character sequence", for the character sequence s of the literal text xyz xyyxyyz, in practical application, a character string suffix tree search algorithm may be used to perform solution, and a suffix tree sequence of the character sequence s is first generated, as shown in table 1 below.
TABLE 1 suffix tree sequence of character sequence s of literal text xyzxyyz
Suffix tree array
Suffix tree sequence
substrs[0]
x y z x y x y z
substrs[1]
y z x y x y z
substrs[2]
z x y x y z
substrs[3]
x y x y z
substrs[4]
y x y z
substrs[5]
x y z
substrs[6]
y z
substrs[7]
z
By comparing the first j-i characters of the suffix tree sequence substrs [ i ] and the suffix tree sequence substrs [ j ], if the j-i characters are the same, the j-i characters can be determined as a continuous repeated identical subsequence. As shown in Table 1, the first two characters of suffix tree sequence substr [3] are the same as the first two characters of suffix tree sequence substr [5], so xy is a consecutive repeat of the same subsequence. Traversing all suffix tree sequences can determine that the continuous repeated identical subsequence with the largest repetition number is xy, and the repetition number is 2.
The above process can be implemented by
The input character sequence s may be a character sequence of a text to be recognized or a character sequence of a text. When the character sequence s is the character sequence of the above text xyz, N is 8, s is { s0, s1, s2, … … s7}, where s0 is x, s1 is y, s2 is z, …, s7 is z, and the suffix [ i ] of
In one implementation, referring to fig. 5, the step of determining the continuously repeated similar sub-sequence in step S402 may specifically include:
in step S501, a plurality of suffix tree sequences and codes corresponding to the respective suffix tree sequences are generated according to a preset rule based on the character sequence.
In practical application, the character sequences corresponding to the overlook screen and the team-brushing comment are characterized in that similar subsequences are periodically distributed, only one or two characters are different between the similar subsequences, and the length is basically fixed. Since similar subsequences are not identical (i.e. similarity is not 100%), consecutive repeated similar subsequences in this case cannot be extracted using
Specifically, taking the character sequence s ═ { a, b, c, d, e, a, b, c, d, f, a, b, c, d, g } as an example, each suffix tree sequence s [ i: N-1] and the code i corresponding to each suffix tree sequence are generated according to a predetermined rule such as substring [ i ]: s [ i: N-1], (N ═ 15), as shown in table 2 below.
TABLE 2 suffix tree sequence of character sequence s
Suffix tree array
Suffix tree sequence
substrs[0]
abcdeabcdfabcdg
substrs[1]
bcdeabcdfabcdg
substrs[2]
cdeabcdfabcdg
substrs[3]
deabcdfabcdg
substrs[4]
eabcdfabcdg
substrs[5]
abcdfabcdg
substrs[6]
bcdfabcdg
…
…
substrs[14]
g
In step S502, a first target subsequence having a preset length from the beginning in the first suffix tree sequence and a second target subsequence having a preset length from the beginning in the second suffix tree sequence are determined.
Specifically, the first and second suffix tree sequences are two different suffix tree sequences of the plurality of suffix tree sequences. Assuming that the preset length is 3, the first suffix tree sequence is substr [1], the second suffix tree sequence is substr [4], then the first target subsequence ss1 is bcd, and the second target subsequence ss2 is eab.
In order to avoid the brute force type multi-dimensional search, under the condition that the length of the continuous repetition similar subsequence is unknown, how to determine the preset length to extract the subsequence needs to be considered. Because the autocorrelation function of the periodic sequence presents a peak value at the sequence period, and the characteristic of periodic distribution of the similar subsequences is considered, the repetition period of the similar subsequences can be determined by finding the maximum value of the autocorrelation function of the character sequence, and then the preset length is determined according to the repetition period.
For example, the step of determining the preset length may specifically include: obtaining an autocorrelation function of the character sequence; determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function; and determining the preset length according to the repetition period.
In particular, to calculate the autocorrelation function of a sequence of characters s, the sequence of characters may first be digitally encoded. If the character sequence s ═ s0,s1,s2,……sN-1The digital coding sequence of the code is x ═ x0,x1,x2,……xN-1Then its autocorrelation function is defined as:
wherein k is 0, 1
If the repetition period of the character sequence is T, r (k) reaches a maximum value at k ═ T, and therefore the period T is estimated by the following formula:
for example, if the character sequence s is abcdeabcdeabcdeabcdeabccde, which comprises a consecutive repetition of the same subsequence, the numerical code sequence x is 1234512345123451234512345, as shown in fig. 6a for its graph of the autocorrelation function. If the character sequence s is abcdeabcdfacbcdgabebcy, comprising a continuously repeated similar subsequence, the numerical coding sequence x is 1234512346123471234812349, and its autocorrelation function is shown in fig. 6 b. As a result of observation, the autocorrelation function of each character sequence s reaches a maximum value at a position where the period T is 5, and it can be determined that the repetition period of the character string in the character sequence s is 5. The character sequence s may be a character sequence of a text to be recognized or a character sequence of a text of a word.
In practical applications, the preset length may be set as a repetition period, a numerical value of ± 1 of the repetition period (that is, an absolute value of a difference between the preset length and the repetition period is 1), and the like, and a specific value of the preset length may be determined according to an actual situation, which is not limited in the present application.
The calculation rate can be improved by reasonably setting the preset length, continuous repeated similar subsequences do not need to be determined from target subsequences with any length, and the problem of complex calculation caused by violent search is avoided.
In step S503, when the absolute value of the difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and the similarity between the first target subsequence and the second target subsequence is greater than or equal to the preset threshold, the first target subsequence and the second target subsequence are determined to be consecutive repeated similar subsequences.
Specifically, when the absolute value | j-i | of the difference between the code i of the first suffix tree sequence substr [ i ] and the code j of the second suffix tree sequence substr [ j ] is equal to the preset length, the similarity between the first target subsequence ss1 and the second target subsequence ss2 may be calculated, and when the similarity is greater than or equal to a preset threshold, such as 70%, the first target subsequence ss1 and the second target subsequence ss2 are determined to be consecutive repeated similar subsequences. The specific value of the preset threshold may be determined according to actual conditions, and the application does not limit this.
To extract the consecutive repeated similar subsequences, a metric criterion for similarity of the character subsequences needs to be considered. The Jaccard similarity can be used for subsequence similarity measurements, considering that similar subsequences differ by only one or two characters.
Therefore, the similarity between the first target subsequence ss1 and the second target subsequence ss2 can be obtained by calculating the jaccard similarity J (ss1, ss2) between the two:
wherein ss1 ∩ ss2 is the intersection of the first target subsequence ss1 and the second target subsequence ss2, ss1 ∪ ss2 is the union of the first target subsequence ss1 and the second target subsequence ss2, and | is the number of sets.
In this embodiment, based on the Jaccard similarity metric criterion of the subsequence and the subsequence period estimation based on the autocorrelation function, a continuous repeated similar subsequence is extracted from the character sequence s, and the repetition number of the continuous repeated similar subsequence with the largest repetition number is determined, and the specific flow is shown in the following
The input character sequence s may be a character sequence of a text to be recognized or a character sequence of a text. When the character sequence s is { a, b, c, d, e, a, b, c, d, f, a, b, c, d, g }, N is 15, where s0 is a, s1 is b, s2 is c, …, s14 is g, the autocorrelation function r (k) of the character sequence is calculated according to
In one implementation, the step S304 may further include: and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
Referring to fig. 7, the step of obtaining a clustering model in advance may include:
in step S701, a sample text is obtained, and the sample text is labeled, so as to obtain a type tag of the sample text.
Specifically, the sample text may be e-commerce user comments, microblog user comments, social media comments, short video comments, and the like. And performing type label marking on the sample text according to whether the sample text is a spam comment such as a overlook screen and a team brushing comment.
In step S702, a decision tree algorithm is used to perform model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text, and the type label, so as to obtain a clustering model.
The extraction of the continuous repeated subsequence features of the sample text (which may include the continuous repeated subsequence features of the original sample text and the continuous repeated subsequence features of the text of the original sample text) may refer to the description in step S303, and the extraction of the basic feature set of the sample text may refer to the description in step S301, which is not described herein again.
The decision tree algorithm may be Xgboost, random forest, Adaboost, or gradient boost decision tree.
The following description will take a gradient boosting decision tree as an example. A Gradient Boosting Decision Tree (GBDT) boosts classification Decision performance by fusing a plurality of weak classifiers along the loss function residual gradient direction. If sample text sequence xiAnd the corresponding type label is yiAnd if the m-th iteration weak classifier is T (x; theta m), the final classifier is:
wherein M is the maximum number of iterations.
If the loss function of the weak classifier is defined as the likelihood loss function:
L(y,F(x))=∑yilog(F(xi))+(1-yi)log(1-F(xi))
the m-th iteration weak classifier parameters are estimated as
Wherein, Fm(xi)=Fm-1(xi)+T(xi;θm) K is the number of sample texts, and F can be set0(xi)=0。
Considering that the text features have correlation and the threshold of a single text feature is difficult to set, a gradient boosting decision tree classification algorithm GBDT is used for clustering the feature set. The GBDT algorithm is used for realizing a high-precision classifier by performing iterative learning integration on a plurality of weak classifiers along the gradient direction of classification residual errors.
By adopting the text recognition method provided by the application, mass comment data are verified, and the result shows that the technical scheme of the application has extremely high accuracy and extremely low false detection rate for the overlook screen and team-swiping comments. The specific verification conditions are as follows:
selecting 50000 super screen scrubbing from the labeled samples, taking 50000 non-super screen scrubbing samples as a training set, and performing the following steps: and (4) extracting 20000 overlord screen swipes from the residual sample library in an 8-proportion mode, and testing 80000 non-overlord screen swipes samples. Considering the similar subsequence length distribution and the difference distribution, the Jaccard similarity preset threshold thres in
TABLE 3 training and testing sample detection accuracy and recall
Categories
Garbage collection
Non-garbage
TP
FP
FN
Acc(%)
Rec(%)
Training
50000
50000
49912
115
88
99.97
99.82
Testing
20000
80000
19934
289
66
98.57
99.67
Wherein TP is the number of correct detections of garbage samples, FP is the number of detected garbage samples of non-garbage samples, FN is the number of false detections of garbage samples, Acc is the accuracy rate, and Rec is the recall rate.
According to the text recognition method provided by the embodiment, the extracted sample characteristics are subjected to model training by using a gradient boosting decision tree GBDT algorithm to obtain the decision tree of the text, and the sample characteristics take the characteristics of more special symbols and high repetition rate of the overlook screen and the fleeting type comments into consideration, so that high-precision detection of the overlook screen and the fleeing type comments can be realized.
Fig. 9 is a block diagram of a text recognition apparatus shown in the present application. Referring to fig. 9, the apparatus may include:
an obtaining module 901, configured to obtain a basic feature set of a text to be recognized, where the basic feature set is a set of length and proportion features of characters and symbols of each predetermined type included in the text to be recognized;
a generating module 902, configured to generate a text corresponding to the text to be recognized, where the text includes the text to be recognized and does not include symbols of each predetermined type;
an extracting module 903, configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, where the continuous repeated subsequence features are used to represent information of repeated occurrences of characters and symbols of each predetermined type in the corresponding text;
and a clustering module 904 configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.
In an optional implementation manner, the obtaining module 901 is further configured to:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
In an optional implementation, the extracting module 303 includes:
a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;
the second unit is configured to determine that the target subsequence is a continuous repeated similar subsequence of the corresponding text when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions;
and a third unit configured to determine the repetition number, the length and the ratio in the corresponding text of the continuous repeated similar subsequence with the largest repetition number in the corresponding text as the continuous repeated subsequence feature of the corresponding text.
In an optional implementation, the second unit is further configured to:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
In an optional implementation, the second unit is further configured to:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
In an optional implementation, the clustering module 304 is further configured to:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
In an optional implementation, the apparatus further comprises a training module 305, the training module 305 being configured to:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs operations and advantageous effects have been described in detail in the embodiment related to the method, and will not be elaborated upon here.
Fig. 10 is a block diagram of an
Referring to fig. 10,
The
The
The
The
The
The I/
The
The
In an exemplary embodiment, the
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the
Fig. 11 is a block diagram of an
Referring to fig. 11,
The
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
In this embodiment, the user information (including, but not limited to, device information, personal information, and operation behavior information) is collected and subjected to subsequent processing or analysis by the user authorization.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
A1, a text recognition method, the method comprising:
acquiring a basic feature set of a text to be recognized, wherein the basic feature set is a set of characters contained in the text to be recognized and length and proportion features of symbols of each preset type;
generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;
extracting continuous repeated subsequence features from the text to be recognized and the character text respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is a text containing repeated sequences based on the clustering result.
A2, according to the text recognition method of A1, the step of obtaining the basic feature set of the text to be recognized includes:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
A3, according to the text recognition method of A1, the step of extracting continuous repeated subsequence features from the text to be recognized and the text respectively comprises:
respectively generating character sequences of the text to be recognized and the literal text;
when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions, determining the target subsequences as continuous repeated similar subsequences of the corresponding text;
and determining the repetition times and the lengths of the continuous repeated similar subsequences with the maximum repetition times in the corresponding texts and the ratios in the corresponding texts as the continuous repeated subsequence characteristics of the corresponding texts.
A4, according to the text recognition method of A3, when the length and the similarity between two continuous target subsequences in the character sequence both satisfy the preset conditions, the step of determining the target subsequences as continuous repeated similar subsequences of the corresponding text comprises:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
A5, before the step of determining a first target subsequence of a preset length from the beginning in the first suffix tree sequence and a second target subsequence of the preset length from the beginning in the second suffix tree sequence, according to the method of a4, further comprising:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
A6, the method according to any one of A1 to A5, wherein the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set comprises:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
A7, according to the method in A6, before the step of clustering the features of the continuous repeated sub-sequence features and the basic feature set by using the pre-trained clustering model, the method further includes:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.
A8, a text recognition device, the device comprising:
the system comprises an acquisition module, a comparison module and a recognition module, wherein the acquisition module is configured to acquire a basic feature set of a text to be recognized, and the basic feature set is a set of characters contained in the text to be recognized and length and ratio features of symbols of each preset type;
the generating module is configured to generate a text corresponding to the text to be recognized, wherein the text comprises characters of the text to be recognized and does not comprise symbols of each preset type;
the extraction module is configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;
and the clustering module is configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.
A9, the text recognition apparatus of A8, the obtaining module further configured to:
calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;
calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;
calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;
determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.
A10, the text recognition apparatus of A8, the extraction module comprising:
a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;
the second unit is configured to determine that the target subsequence is a continuous repeated similar subsequence of the corresponding text when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions;
and a third unit configured to determine the repetition number, the length and the ratio in the corresponding text of the continuous repeated similar subsequence with the largest repetition number in the corresponding text as the continuous repeated subsequence feature of the corresponding text.
A11, the text recognition apparatus of A10, the second element further configured to:
generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;
determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;
determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.
A12, the text recognition apparatus of A11, the second element further configured to:
obtaining an autocorrelation function of the character sequence;
determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;
and determining the preset length according to the repetition period.
A13, the text recognition apparatus of any one of A8 to A12, the clustering module further configured to:
and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.
A14, the text recognition apparatus of A13, the apparatus further comprising a training module configured to:
obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;
and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.