Text recognition method and device, electronic equipment and storage medium

文档序号：1556948 发布日期：2020-01-21 浏览：31次中文

阅读说明：本技术 文本识别方法、装置、电子设备及存储介质 (Text recognition method and device, electronic equipment and storage medium ) 是由刘春� 于 2019-07-04 设计创作，主要内容包括：本申请示出了一种文本识别方法、装置、电子设备及存储介质,所述文本识别方法包括：获取待识别文本的基本特征集合；生成与待识别文本对应的文字文本；分别从待识别文本和文字文本中提取连续重复子序列特征；基于连续重复子序列特征和基本特征集合进行特征聚类,得到聚类结果,并基于聚类结果检测待识别文本是否为包含重复序列的文本。基于连续重复子序列特征及基本特征集合进行特征聚类,确定待识别文本的类型,由于基本特征集合能够体现霸屏、刷队类评论特殊符号多的特点,连续重复子序列特征能够体现霸屏、刷队类评论重复率高的特点,因此,本申请能够更加准确地识别出霸屏、刷队类垃圾评论文本。(The application discloses a text recognition method, a text recognition device, electronic equipment and a storage medium, wherein the text recognition method comprises the following steps: acquiring a basic feature set of a text to be recognized; generating a text corresponding to the text to be recognized; respectively extracting continuous repeated subsequence features from the text to be recognized and the text; and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is the text containing the repeated sequences based on the clustering result. The method comprises the steps of clustering features based on continuous repeated subsequence features and basic feature sets, determining the type of a text to be identified, and identifying spam comment texts of a overlook screen and a fleeting team due to the fact that the basic feature sets can reflect the characteristic that special symbols of the overlook screen and the fleeing team comment are many and the continuous repeated subsequence features can reflect the characteristic that the repetition rate of the overlook screen and the fleeing team comment is high.)

1. A method of text recognition, the method comprising:

acquiring a basic feature set of a text to be recognized, wherein the basic feature set is a set of characters contained in the text to be recognized and length and proportion features of symbols of each preset type;

generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;

extracting continuous repeated subsequence features from the text to be recognized and the character text respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;

and performing feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detecting whether the text to be identified is a text containing repeated sequences based on the clustering result.

2. The text recognition method of claim 1, wherein the step of obtaining the basic feature set of the text to be recognized comprises:

calculating the length of a character text contained in the text to be recognized and the maximum length of a continuous special symbol sequence, wherein the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols except Chinese characters, letters and expression symbols in the text to be recognized;

calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

3. The text recognition method of claim 1, wherein the step of extracting the continuous repeated subsequence features from the text to be recognized and the text respectively comprises:

respectively generating character sequences of the text to be recognized and the literal text;

when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions, determining the target subsequences as continuous repeated similar subsequences of the corresponding text;

and determining the repetition times and the lengths of the continuous repeated similar subsequences with the maximum repetition times in the corresponding texts and the ratios in the corresponding texts as the continuous repeated subsequence characteristics of the corresponding texts.

4. The text recognition method of claim 3, wherein when the length and the similarity between two consecutive target subsequences in the character sequence both satisfy a preset condition, the step of determining the target subsequences as consecutive repeated similar subsequences of the corresponding text comprises:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

determining a first target subsequence with the length being a preset length from the head in the first suffix tree sequence and a second target subsequence with the length being the preset length from the head in the second suffix tree sequence;

determining the first target subsequence and the second target subsequence to be consecutive repeated similar subsequences when an absolute value of a difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and a similarity between the first target subsequence and the second target subsequence is greater than or equal to a preset threshold.

5. The method according to claim 4, wherein before the step of determining the first target subsequence having a preset length from the first position in the first suffix tree sequence and the second target subsequence having the preset length from the first position in the second suffix tree sequence, the method further comprises:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

6. The method according to any one of claims 1 to 5, wherein the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set comprises:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

7. The method according to claim 6, wherein before the step of performing feature clustering on the continuous repeated subsequence features and the basic feature set by using the pre-trained clustering model, the method further comprises:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

and performing model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text and the type label by adopting a decision tree algorithm to obtain the clustering model.

8. A text recognition apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition module, a comparison module and a recognition module, wherein the acquisition module is configured to acquire a basic feature set of a text to be recognized, and the basic feature set is a set of characters contained in the text to be recognized and length and ratio features of symbols of each preset type;

the generating module is configured to generate a text corresponding to the text to be recognized, wherein the text comprises characters of the text to be recognized and does not comprise symbols of each preset type;

the extraction module is configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, wherein the continuous repeated subsequence features are used for representing information of repeated appearance of characters and symbols of each preset type in the corresponding text;

and the clustering module is configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the text recognition method of any one of claims 1-7.

10. A non-transitory computer readable storage medium, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of any one of claims 1-7.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text recognition method and apparatus, an electronic device, and a storage medium.

Background

In the prior art, the freely published comments of the users in the social platform greatly improve the viewing experience of the users, connect the users and the authors, and realize the social relationship between the users. However, spam comments such as screen overlooking, team brushing and the like issued by some users seriously affect the experience of normal users.

Disclosure of Invention

In order to overcome the problems in the related art, the application provides a text recognition method, a text recognition device, an electronic device and a storage medium.

According to a first aspect of the present application, there is provided a text recognition method, the method comprising:

generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;

In an optional implementation manner, the step of obtaining the basic feature set of the text to be recognized includes:

calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

In an optional implementation manner, the step of extracting continuous repeated subsequence features from the text to be recognized and the text respectively includes:

respectively generating character sequences of the text to be recognized and the literal text;

In an optional implementation manner, when there is a character sequence in which the length and the similarity between two consecutive target subsequences both satisfy a preset condition, the step of determining the target subsequences as consecutive repeated similar subsequences of the corresponding text includes:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

In an optional implementation manner, before the step of determining a first target subsequence having a preset length from the first position in the first suffix tree sequence and a second target subsequence having the preset length from the first position in the second suffix tree sequence, the method further includes:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

In an optional implementation manner, the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set includes:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

In an optional implementation manner, before the step of performing feature clustering on the continuous repeated subsequence feature and the basic feature set by using a clustering model obtained through pre-training, the method further includes:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

According to a second aspect of the present application, there is provided a text recognition apparatus, the apparatus comprising:

In an optional implementation, the obtaining module is further configured to:

calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

In one optional implementation, the extraction module includes:

a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;

the second unit is configured to determine that the target subsequence is a continuous repeated similar subsequence of the corresponding text when the length and the similarity between two continuous target subsequences in the character sequence both meet preset conditions;

and a third unit configured to determine the repetition number, the length and the ratio in the corresponding text of the continuous repeated similar subsequence with the largest repetition number in the corresponding text as the continuous repeated subsequence feature of the corresponding text.

In an optional implementation, the second unit is further configured to:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

In an optional implementation, the second unit is further configured to:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

In an optional implementation, the clustering module is further configured to:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

In an optional implementation, the apparatus further comprises a training module configured to:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

According to a third aspect of the present application, there is provided an electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the text recognition method according to the first aspect.

According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of the first aspect.

According to a fifth aspect of the present application, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method according to the first aspect.

The technical scheme provided by the application can comprise the following beneficial effects:

according to the technical scheme, feature clustering is carried out on the basis of the continuous repeated subsequence features and the basic feature set, the type of the text to be recognized is determined, the basic feature set can reflect the characteristic that the number of the special symbols of the overlook screen and the type of team-brushing comments is large, and the continuous repeated subsequence features can reflect the characteristic that the repetition rate of the overlook screen and the type of team-brushing comments is high, so that the overlook screen and the type of team-brushing spam comment text can be recognized more accurately.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 shows several kinds of highlight comment texts in an embodiment of the present application.

Fig. 2 is several types of brushing comment texts shown in the embodiment of the present application.

Fig. 3 is a flowchart illustrating steps of a text recognition method according to an embodiment of the present application.

Fig. 4 is a flowchart illustrating a procedure for extracting a feature of a consecutive repeated sub-sequence according to an embodiment of the present application.

Fig. 5 is a flowchart illustrating a step of determining a consecutive repeated similar sub-sequence according to an embodiment of the present application.

Fig. 6a is a graph of an autocorrelation function of a continuously repeated identical subsequence as shown in an embodiment of the present application.

Fig. 6b is a graph of an autocorrelation function of a continuously repeated similar subsequence as shown in an embodiment of the present application.

Fig. 7 is a flowchart illustrating a step of obtaining a clustering model according to an embodiment of the present application.

Fig. 8 is a distribution diagram of the overall ratio of the continuously repeated similar subsequences in the training sample shown in the embodiment of the present application.

Fig. 9 is a block diagram illustrating a structure of a text recognition apparatus according to an embodiment of the present application.

Fig. 10 is a block diagram of an electronic device according to an embodiment of the present application.

Fig. 11 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

According to the difference of the comment feature extraction method, the existing spam comment identification method mainly comprises three types of methods based on rules, word frequency and spam vocabulary distribution features and comment semantic distribution features. The automatic spam comment classification method based on word distribution and document characteristics achieves automatic spam comment classification by counting keyword frequency distribution and document characteristics of network comments and conducting Bayesian classification. According to the comment similarity-based spam comment identification method and device, comment probability distribution is obtained through training a language model of comment data, a comment probability distribution library is built, and then comments are determined by comparing the comment probability distribution with the comment probability distribution similarity in the library to achieve spam comment detection. The method comprises the steps of firstly obtaining multi-vector representation of a text, and then classifying the multi-vector by using a classifier. A spam comment detection model based on a hierarchical attention mechanism neural network obtains semantic representation characteristics of sentences through a hierarchical attention based neural network HANN and realizes classification.

The Spam comments comprise categories such as overlook screens, fleeing, cheating, low customs and Spam, and the existing Spam comment detection technical scheme is a general detection method for realizing the Spam comments of different categories. In order to realize the general detection of the spam comments of different categories, the detection is realized by extracting comment words or sentence distribution expression characteristics and the like for classification.

FIG. 1 illustrates several kinds of empress-screen comments, which serve the purpose of holding the comment space by posting large space meaningless information. The inventor finds that the key characteristics of the overlook screen comment are that the special symbols such as spaces, tabulation carriage returns and the like have extremely high proportion, and part of the overlook screen comment can continuously generate a plurality of identical special symbols and punctuation mark sequences.

Fig. 2 shows several swizzle reviews, which the inventors found to be typically composed of identical or similar string repeats. According to the repeated composition of the same or similar character strings, the brushing comments can be classified into the same character category or the similar character category.

Therefore, the overlook screen and the team-brushing comment have the characteristics of many special symbols and high repetition rate, the prominent features of the overlook screen and the team-brushing comment cannot be reflected when the prior art scheme is used for detecting the overlook screen and the team-brushing comment, and high-precision detection cannot be realized.

In order to solve the above technical problem, a text recognition method provided by an embodiment of the present application is shown with reference to fig. 3, and the method includes the following steps.

In step S301, a basic feature set of the text to be recognized is obtained, where the basic feature set is a set of length and ratio features of characters and symbols of each predetermined type included in the text to be recognized.

The basic feature set may include, for example: chinese characters, letters, numbers, emoji characters, punctuation marks, spaces, tabulation carriage returns and other symbols, the length of each of 8 categories and the proportion in the text to be recognized.

Specifically, the text to be recognized may be preprocessed according to the character categories, that is, the text to be recognized is divided into 8 categories, that is, the categories include a Chinese character, a letter, a number, an emoji character, a punctuation mark, a space, a tabulation carriage return symbol, and other symbols, and then the length of each category and the ratio of the category to the text to be recognized are respectively calculated, so as to determine the basic feature set.

When the basic feature set is the combination of the text length, the text proportion (or non-text proportion), the maximum length of the continuous special symbol sequence and the special character proportion, the better recognition effect can be achieved, and the recognition accuracy is higher. The length of the text of the characters can be, for example, the length of continuous or discontinuous Chinese character text in the text to be recognized; the continuous special symbol sequence is a sequence consisting of continuous special symbols, and the special symbols are symbols in the text to be recognized except Chinese characters, letters and emoticons because the letters and emoji symbols generally have meanings in the text. In this case, the step of acquiring the basic feature set of the text to be recognized may include:

calculating the length of a character text (such as the length of a Chinese character text) contained in the text to be recognized and the maximum length of a continuous special symbol sequence; calculating a first proportion of the text contained in the text to be recognized according to the length of the text and the length of the text to be recognized; calculating a second proportion of the continuous special symbol sequence in the text to be recognized according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized; determining the text length, the maximum length of the continuous special symbol sequence, the first proportion and the second proportion as the elements of the basic feature set.

In step S302, a text corresponding to the text to be recognized is generated, where the text includes the text to be recognized and does not include symbols of each predetermined type.

Specifically, the text may be a text obtained by removing special characters from the text to be recognized.

In step S303, continuous repeated subsequence features are extracted from the text to be recognized and the text of the characters, respectively, where the continuous repeated subsequence features are used to represent information of repeated occurrences of characters and symbols of each predetermined type in the corresponding text.

The continuous repeated sub-sequence features may include the number of times of repeating the same sub-sequence or the similar sub-sequence, the length, and the ratio in the corresponding text.

In practical application, a repeated text search algorithm can be adopted to respectively extract target subsequences which continuously appear and have the length and the similarity meeting preset conditions from character sequences generated according to a text to be recognized and a text, and then continuous repeated subsequence characteristics of the corresponding text are determined according to the repetition times and the length of the target subsequences and the proportion of the target subsequences in the corresponding text. The repeated text search algorithm can be a suffix tree algorithm, and can also be a repeated text search algorithm based on autocorrelation period estimation and jaccard similarity measurement. The following embodiments will describe the extraction process of the continuously repeated sub-sequence features in detail.

In step S304, feature clustering is performed based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and whether the text to be identified is a text containing a repeated sequence is detected based on the clustering result.

In practical application, a clustering algorithm such as a neighbor propagation algorithm, Mean-shift and the like can be adopted to perform feature clustering on the continuous repeated subsequence features and the basic feature set, and a clustering model obtained by pre-training can be adopted to perform feature clustering on the continuous repeated subsequence features and the basic feature set.

Specifically, whether the text to be recognized is the rob-screen refreshing comment or the rob-screen comment can be determined according to the clustering result, and further, whether the text to be recognized belongs to the rob-screen refreshing comment or the rob-screen comment can be determined according to whether the text to be recognized contains the repeated sequence, for example, when the text to be recognized contains the repeated sequence, the text to be recognized can be determined as the rob-screen refreshing comment, otherwise, the text to be recognized is the rob-screen comment.

The text recognition method provided by the embodiment can be used for recognizing comments such as e-commerce user comments, microblog user comments, social media comments, short video comments and the like.

According to the text recognition method provided by the embodiment, feature clustering is carried out on the basis of the continuous repeated subsequence feature and the basic feature set, and the type of the text to be recognized is determined.

Referring to fig. 4, step 303 may further include:

in step S401, character sequences of the text to be recognized and the text of the word are generated, respectively.

Specifically, the step may include: generating a character sequence of the text to be recognized, and generating a character sequence of the literal text in the text to be recognized. The text is obtained by removing special characters in the text to be recognized. And meanwhile, the character sequences of the text to be recognized and the literal text are generated, so that the spam comment recognition accuracy can be further improved.

In step S402, when there are two consecutive target subsequences in the character sequence, the length and the similarity of which both satisfy the preset condition, the target subsequences are determined to be consecutive repeated similar subsequences of the corresponding text.

In practical applications, the steps may specifically include: when the length and the similarity between two continuous target subsequences in the character sequence of the text to be recognized meet preset conditions, the target subsequences are determined to be continuous repeated similar subsequences of the text to be recognized, and when the length and the similarity between two continuous target subsequences in the character sequence of the text meet the preset conditions, the target subsequences are determined to be continuous repeated similar subsequences of the text.

The preset condition may be, for example, a length of a repeating character string in the character sequence, that is, a repeating period, a length value near the repeating period, and the like, and the similarity is greater than or equal to a preset threshold, for example, 70%. The preset condition may be determined according to actual conditions, and this embodiment is not particularly limited.

In step S403, the repetition number and length of the continuous repeated similar subsequence having the largest repetition number in the corresponding text and the ratio in the corresponding text are determined as the continuous repeated subsequence feature of the corresponding text.

In practical application, when a plurality of continuous repeated similar subsequences exist in the character sequence of the text to be recognized, the continuous repeated subsequence feature of the text to be recognized can be a combined feature of the repeat times, the length, the percentage and the like of the continuous repeated subsequence with the largest repeat time in the text to be recognized.

When there are a plurality of continuous repeated similar subsequences in the character sequence of the text, the continuous repeated subsequence feature of the text can be a combined feature of the repeat times, the length, the ratio in the text and other features of the continuous repeated subsequence with the largest repeat time in the text.

The following describes a process of extracting the feature of the continuous repeated subsequence by taking a text to be recognized as an example. Assuming that the preset condition is that the length of the target subsequence is 2, and the similarity is 100%, i.e. the two target subsequences are identical.

When the text to be recognized is xy% z & x y @ x # yz, the character sequence s ' ═ { s0 ', s1 ', s2 ', … … s12 ' } of the text to be recognized is generated, wherein s0 ' ═ x, s1 ' ═ y, s2 ' ═ …, s12 ' ═ z. Two continuous target subsequences with the length and the similarity meeting the preset conditions do not exist in the character sequence, so that continuous repeated similar subsequences do not exist.

Extracting the text to be recognized as xyz xyyxyz from the text to be recognized, and generating a character sequence of the text, s { s0, s1, s2, … … s7}, wherein s0 ═ x, s1 ═ y, s2 ═ z, …, and s7 ═ z. Two target subsequences which are continuous and have the length and the similarity meeting the preset conditions are present in the character sequence, namely s3s4(xy) and s5s6(xy), so that a continuous repeated similar subsequence xy is present in the character sequence of the literal text, and xy is the continuous repeated subsequence with the largest repetition number in the character sequence. Furthermore, the continuous repeated similar subsequence xy with the largest repetition number is determined to have the repetition number of 2, the length of 2-bit characters and the proportion of 25 percent in the character sequence.

When the similarity is 100%, extracting the continuous repeated similar subsequence, that is, extracting the continuous repeated identical subsequence, which may be specifically described as "knowing a character sequence, solving for the 2-bit character string with the largest number of continuous occurrences in the character sequence", for the character sequence s of the literal text xyz xyyxyyz, in practical application, a character string suffix tree search algorithm may be used to perform solution, and a suffix tree sequence of the character sequence s is first generated, as shown in table 1 below.

TABLE 1 suffix tree sequence of character sequence s of literal text xyzxyyz

Suffix tree array	Suffix tree sequence
		substrs[0]	x y z x y x y z
substrs[1]	y z x y x y z
		substrs[2]	z x y x y z
substrs[3]	x y x y z
		substrs[4]	y x y z
substrs[5]	x y z
		substrs[6]	y z
substrs[7]	z

By comparing the first j-i characters of the suffix tree sequence substrs [ i ] and the suffix tree sequence substrs [ j ], if the j-i characters are the same, the j-i characters can be determined as a continuous repeated identical subsequence. As shown in Table 1, the first two characters of suffix tree sequence substr [3] are the same as the first two characters of suffix tree sequence substr [5], so xy is a consecutive repeat of the same subsequence. Traversing all suffix tree sequences can determine that the continuous repeated identical subsequence with the largest repetition number is xy, and the repetition number is 2.

The above process can be implemented by algorithm 1 based on successive repeated identical sub-sequence extractions of the suffix tree.

Algorithm 1 suffix tree based continuous repetition identical subsequence extraction algorithm

The input character sequence s may be a character sequence of a text to be recognized or a character sequence of a text. When the character sequence s is the character sequence of the above text xyz, N is 8, s is { s0, s1, s2, … … s7}, where s0 is x, s1 is y, s2 is z, …, s7 is z, and the suffix [ i ] of part 1 in algorithm 1 is expressed according to the suffix tree formula]＝s[i:N-1]Generating suffix tree sequence substrs [ i ]]The results are shown in Table 1, where i is 0 to 7. The 2 nd part of Algorithm 1 is then used to search for successively repeated identical subsequences, i.e., by comparing suffix tree sequences substrs [ i]And suffix tree sequence substrs [ j]The first j-i characters of (1) are substrs [ i ] respectively][0:j-i-1]And substrs [ j ]][0:j-i-1]If substrs [ i][0:j-i-1]And substrs [ j ]][0:j-i-1]If the same, the j-i characters can be determined as the continuous repeated identical subsequence, and all suffix tree sequences are traversed]The same subsequence msubstr (e.g., xy) repeated the most frequently, the length l (e.g., 2) of the same subsequence repeated continuously, and the number of times maxcount (e.g., 2) repeated continuously can be determined, and the ratio r of the same subsequence repeated continuously in the character sequence s of the literal text xyz xyyxyyz can be further determined_maxMaxcount L/L (e.g., 25%), where L is the total length of the character sequence s.

In one implementation, referring to fig. 5, the step of determining the continuously repeated similar sub-sequence in step S402 may specifically include:

in step S501, a plurality of suffix tree sequences and codes corresponding to the respective suffix tree sequences are generated according to a preset rule based on the character sequence.

In practical application, the character sequences corresponding to the overlook screen and the team-brushing comment are characterized in that similar subsequences are periodically distributed, only one or two characters are different between the similar subsequences, and the length is basically fixed. Since similar subsequences are not identical (i.e. similarity is not 100%), consecutive repeated similar subsequences in this case cannot be extracted using algorithm 1.

Specifically, taking the character sequence s ═ { a, b, c, d, e, a, b, c, d, f, a, b, c, d, g } as an example, each suffix tree sequence s [ i: N-1] and the code i corresponding to each suffix tree sequence are generated according to a predetermined rule such as substring [ i ]: s [ i: N-1], (N ═ 15), as shown in table 2 below.

TABLE 2 suffix tree sequence of character sequence s

Suffix tree array	Suffix tree sequence
		substrs[0]	abcdeabcdfabcdg
substrs[1]	bcdeabcdfabcdg
		substrs[2]	cdeabcdfabcdg
substrs[3]	deabcdfabcdg
		substrs[4]	eabcdfabcdg
substrs[5]	abcdfabcdg
		substrs[6]	bcdfabcdg
…	…
		substrs[14]	g

In step S502, a first target subsequence having a preset length from the beginning in the first suffix tree sequence and a second target subsequence having a preset length from the beginning in the second suffix tree sequence are determined.

Specifically, the first and second suffix tree sequences are two different suffix tree sequences of the plurality of suffix tree sequences. Assuming that the preset length is 3, the first suffix tree sequence is substr [1], the second suffix tree sequence is substr [4], then the first target subsequence ss1 is bcd, and the second target subsequence ss2 is eab.

In order to avoid the brute force type multi-dimensional search, under the condition that the length of the continuous repetition similar subsequence is unknown, how to determine the preset length to extract the subsequence needs to be considered. Because the autocorrelation function of the periodic sequence presents a peak value at the sequence period, and the characteristic of periodic distribution of the similar subsequences is considered, the repetition period of the similar subsequences can be determined by finding the maximum value of the autocorrelation function of the character sequence, and then the preset length is determined according to the repetition period.

For example, the step of determining the preset length may specifically include: obtaining an autocorrelation function of the character sequence; determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function; and determining the preset length according to the repetition period.

In particular, to calculate the autocorrelation function of a sequence of characters s, the sequence of characters may first be digitally encoded. If the character sequence s ═ s₀,s₁,s₂,……s_N-1The digital coding sequence of the code is x ═ x₀,x₁,x₂,……x_N-1Then its autocorrelation function is defined as:

wherein k is 0, 1

If the repetition period of the character sequence is T, r (k) reaches a maximum value at k ═ T, and therefore the period T is estimated by the following formula:

for example, if the character sequence s is abcdeabcdeabcdeabcdeabccde, which comprises a consecutive repetition of the same subsequence, the numerical code sequence x is 1234512345123451234512345, as shown in fig. 6a for its graph of the autocorrelation function. If the character sequence s is abcdeabcdfacbcdgabebcy, comprising a continuously repeated similar subsequence, the numerical coding sequence x is 1234512346123471234812349, and its autocorrelation function is shown in fig. 6 b. As a result of observation, the autocorrelation function of each character sequence s reaches a maximum value at a position where the period T is 5, and it can be determined that the repetition period of the character string in the character sequence s is 5. The character sequence s may be a character sequence of a text to be recognized or a character sequence of a text of a word.

In practical applications, the preset length may be set as a repetition period, a numerical value of ± 1 of the repetition period (that is, an absolute value of a difference between the preset length and the repetition period is 1), and the like, and a specific value of the preset length may be determined according to an actual situation, which is not limited in the present application.

The calculation rate can be improved by reasonably setting the preset length, continuous repeated similar subsequences do not need to be determined from target subsequences with any length, and the problem of complex calculation caused by violent search is avoided.

In step S503, when the absolute value of the difference between the encoding of the first suffix tree sequence and the encoding of the second suffix tree sequence is equal to the preset length and the similarity between the first target subsequence and the second target subsequence is greater than or equal to the preset threshold, the first target subsequence and the second target subsequence are determined to be consecutive repeated similar subsequences.

Specifically, when the absolute value | j-i | of the difference between the code i of the first suffix tree sequence substr [ i ] and the code j of the second suffix tree sequence substr [ j ] is equal to the preset length, the similarity between the first target subsequence ss1 and the second target subsequence ss2 may be calculated, and when the similarity is greater than or equal to a preset threshold, such as 70%, the first target subsequence ss1 and the second target subsequence ss2 are determined to be consecutive repeated similar subsequences. The specific value of the preset threshold may be determined according to actual conditions, and the application does not limit this.

To extract the consecutive repeated similar subsequences, a metric criterion for similarity of the character subsequences needs to be considered. The Jaccard similarity can be used for subsequence similarity measurements, considering that similar subsequences differ by only one or two characters.

Therefore, the similarity between the first target subsequence ss1 and the second target subsequence ss2 can be obtained by calculating the jaccard similarity J (ss1, ss2) between the two:

wherein ss1 ∩ ss2 is the intersection of the first target subsequence ss1 and the second target subsequence ss2, ss1 ∪ ss2 is the union of the first target subsequence ss1 and the second target subsequence ss2, and | is the number of sets.

In this embodiment, based on the Jaccard similarity metric criterion of the subsequence and the subsequence period estimation based on the autocorrelation function, a continuous repeated similar subsequence is extracted from the character sequence s, and the repetition number of the continuous repeated similar subsequence with the largest repetition number is determined, and the specific flow is shown in the following algorithm 2. In order to avoid the problem of complex calculation caused by violent search, whether the absolute value of the difference between the length | j-i | of the current similar subsequence and the estimation period T is less than or equal to 1 can be judged when searching for the similar subsequence based on the suffix tree.

Algorithm 2 continuous repeated similar subsequence extraction algorithm based on Jaccard similarity and autocorrelation function period estimation

The input character sequence s may be a character sequence of a text to be recognized or a character sequence of a text. When the character sequence s is { a, b, c, d, e, a, b, c, d, f, a, b, c, d, g }, N is 15, where s0 is a, s1 is b, s2 is c, …, s14 is g, the autocorrelation function r (k) of the character sequence is calculated according to part 1 of algorithm 2, and the repetition period T is determined to be 5. Suffix [ i ] formula from part 2 of Algorithm 2]＝s[i:N-1]Generating suffix tree sequence substrs [ i ]]The results are shown in Table 2, where i is 0 to 14. Then, the 3 rd part in the algorithm 2 is adopted to search continuous repeated similar subsequences, and when the absolute value of the difference between the length of the target subsequence (i.e. the preset length | j-i |) and the repetition period T, i.e. | j-i-T | is less than or equal to 1, the first suffix tree sequence substrs [ i |) is compared]A first target subsequence [ i ] with the length from the first position as a preset length][0:j-i-1]With a second suffix tree sequence substrs [ j]A second target subsequence [ j ] with the length from the first position to the preset length][0:j-i-1]Similarity between them jaccard (substrs [ i ]][0:j-i-1]，substrs[j][0:j-i-1]) If the similarity is larger than a preset threshold thres, determining that the first target subsequence and the second target subsequence are continuous repeated similar subsequences, and traversing all suffix tree sequences [ k ]]The continuous repeated similar subsequence msubstr with the largest repetition number, the length l of the continuous repeated similar subsequence and the repetition number maxcount can be determined, and the proportion r of the continuous repeated similar subsequence in the character sequence s can be further determined_maxMaxcount L/L, where L is the total length of the character sequence s. The continuous repeated similar subsequence extracted by the algorithm 2 also comprises the continuous repeated identical subsequence.

In one implementation, the step S304 may further include: and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

Referring to fig. 7, the step of obtaining a clustering model in advance may include:

in step S701, a sample text is obtained, and the sample text is labeled, so as to obtain a type tag of the sample text.

Specifically, the sample text may be e-commerce user comments, microblog user comments, social media comments, short video comments, and the like. And performing type label marking on the sample text according to whether the sample text is a spam comment such as a overlook screen and a team brushing comment.

In step S702, a decision tree algorithm is used to perform model training on the continuous repeated subsequence features of the sample text, the basic feature set of the sample text, and the type label, so as to obtain a clustering model.

The extraction of the continuous repeated subsequence features of the sample text (which may include the continuous repeated subsequence features of the original sample text and the continuous repeated subsequence features of the text of the original sample text) may refer to the description in step S303, and the extraction of the basic feature set of the sample text may refer to the description in step S301, which is not described herein again.

The decision tree algorithm may be Xgboost, random forest, Adaboost, or gradient boost decision tree.

The following description will take a gradient boosting decision tree as an example. A Gradient Boosting Decision Tree (GBDT) boosts classification Decision performance by fusing a plurality of weak classifiers along the loss function residual gradient direction. If sample text sequence x_iAnd the corresponding type label is y_iAnd if the m-th iteration weak classifier is T (x; theta m), the final classifier is:

wherein M is the maximum number of iterations.

If the loss function of the weak classifier is defined as the likelihood loss function:

L(y，F(x))＝∑y_ilog(F(x_i))+(1-y_i)log(1-F(x_i))

the m-th iteration weak classifier parameters are estimated as

Wherein, F_m(x_i)＝F_m-1(x_i)+T(x_i；θ_m) K is the number of sample texts, and F can be set₀(x_i)＝0。

Considering that the text features have correlation and the threshold of a single text feature is difficult to set, a gradient boosting decision tree classification algorithm GBDT is used for clustering the feature set. The GBDT algorithm is used for realizing a high-precision classifier by performing iterative learning integration on a plurality of weak classifiers along the gradient direction of classification residual errors.

By adopting the text recognition method provided by the application, mass comment data are verified, and the result shows that the technical scheme of the application has extremely high accuracy and extremely low false detection rate for the overlook screen and team-swiping comments. The specific verification conditions are as follows:

selecting 50000 super screen scrubbing from the labeled samples, taking 50000 non-super screen scrubbing samples as a training set, and performing the following steps: and (4) extracting 20000 overlord screen swipes from the residual sample library in an 8-proportion mode, and testing 80000 non-overlord screen swipes samples. Considering the similar subsequence length distribution and the difference distribution, the Jaccard similarity preset threshold thres in algorithm 2 is set to 0.7. FIG. 8 is a distribution diagram of the overall proportion rmax of the continuously repeated similar subsequences in the original text of the training sample, wherein 1 represents a super-screen flushing category, and 2 represents a non-super-screen flushing category. According to the observation result, the rmax values of the two types of samples show obvious bimodal distribution, the rmax of the overlord swizzle sample is distributed near 1.0, the rmax of the non-overlord swizzle sample is distributed near 0.2, and the two types of texts can be well distinguished through the rmax characteristic. Some non-blocking screen swizzling samples are found near rmax of 1, and the analysis reason is that the non-blocking screen swizzling text contains parts like '666' orAnd the like, which are distinguishable by text length and non-text proportion. Table 3 shows the results of the tests of the training samples and the test samples. The accuracy rate and the recall rate of the test sample are both higher than 98.5 percent, and the purpose of shielding the heater screen is realizedAnd high-precision and high-recall detection of the brushing type text.

TABLE 3 training and testing sample detection accuracy and recall

Categories	Garbage collection	Non-garbage	TP	FP	FN	Acc(％)	Rec(％)
								Training	50000	50000	49912	115	88	99.97	99.82
Testing	20000	80000	19934	289	66	98.57	99.67

Wherein TP is the number of correct detections of garbage samples, FP is the number of detected garbage samples of non-garbage samples, FN is the number of false detections of garbage samples, Acc is the accuracy rate, and Rec is the recall rate.

According to the text recognition method provided by the embodiment, the extracted sample characteristics are subjected to model training by using a gradient boosting decision tree GBDT algorithm to obtain the decision tree of the text, and the sample characteristics take the characteristics of more special symbols and high repetition rate of the overlook screen and the fleeting type comments into consideration, so that high-precision detection of the overlook screen and the fleeing type comments can be realized.

Fig. 9 is a block diagram of a text recognition apparatus shown in the present application. Referring to fig. 9, the apparatus may include:

an obtaining module 901, configured to obtain a basic feature set of a text to be recognized, where the basic feature set is a set of length and proportion features of characters and symbols of each predetermined type included in the text to be recognized;

a generating module 902, configured to generate a text corresponding to the text to be recognized, where the text includes the text to be recognized and does not include symbols of each predetermined type;

an extracting module 903, configured to extract continuous repeated subsequence features from the text to be recognized and the text of the characters respectively, where the continuous repeated subsequence features are used to represent information of repeated occurrences of characters and symbols of each predetermined type in the corresponding text;

and a clustering module 904 configured to perform feature clustering based on the continuous repeated subsequence features and the basic feature set to obtain a clustering result, and detect whether the text to be identified is a text containing a repeated sequence based on the clustering result.

In an optional implementation manner, the obtaining module 901 is further configured to:

calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

In an optional implementation, the extracting module 303 includes:

a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;

In an optional implementation, the second unit is further configured to:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

In an optional implementation, the second unit is further configured to:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

In an optional implementation, the clustering module 304 is further configured to:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

In an optional implementation, the apparatus further comprises a training module 305, the training module 305 being configured to:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs operations and advantageous effects have been described in detail in the embodiment related to the method, and will not be elaborated upon here.

Fig. 10 is a block diagram of an electronic device 800 shown in the present application. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 10, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, images, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 11 is a block diagram of an electronic device 1900 shown in the present application. For example, the electronic device 1900 may be provided as a server.

Referring to fig. 11, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

In this embodiment, the user information (including, but not limited to, device information, personal information, and operation behavior information) is collected and subjected to subsequent processing or analysis by the user authorization.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

A1, a text recognition method, the method comprising:

generating a text corresponding to the text to be recognized, wherein the text comprises the text to be recognized and does not comprise symbols of each preset type;

A2, according to the text recognition method of A1, the step of obtaining the basic feature set of the text to be recognized includes:

calculating a first ratio of the text to be recognized contained in the text to be recognized according to the length of the text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

A3, according to the text recognition method of A1, the step of extracting continuous repeated subsequence features from the text to be recognized and the text respectively comprises:

respectively generating character sequences of the text to be recognized and the literal text;

A4, according to the text recognition method of A3, when the length and the similarity between two continuous target subsequences in the character sequence both satisfy the preset conditions, the step of determining the target subsequences as continuous repeated similar subsequences of the corresponding text comprises:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

A5, before the step of determining a first target subsequence of a preset length from the beginning in the first suffix tree sequence and a second target subsequence of the preset length from the beginning in the second suffix tree sequence, according to the method of a4, further comprising:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

A6, the method according to any one of A1 to A5, wherein the step of clustering features based on the continuously repeated sub-sequence features and the basic feature set comprises:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

A7, according to the method in A6, before the step of clustering the features of the continuous repeated sub-sequence features and the basic feature set by using the pre-trained clustering model, the method further includes:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

A8, a text recognition device, the device comprising:

A9, the text recognition apparatus of A8, the obtaining module further configured to:

calculating a first ratio of characters contained in the text to be recognized according to the length of the character text and the length of the text to be recognized;

calculating a second ratio of the continuous special symbol sequence according to the maximum length of the continuous special symbol sequence and the length of the text to be recognized;

determining the length of the text, the maximum length of the continuous special symbol sequence, the first ratio and the second ratio as elements of the basic feature set.

A10, the text recognition apparatus of A8, the extraction module comprising:

a first unit configured to generate character sequences of the text to be recognized and the text characters, respectively;

A11, the text recognition apparatus of A10, the second element further configured to:

generating a plurality of suffix tree sequences and codes corresponding to the suffix tree sequences according to a preset rule according to the character sequences;

A12, the text recognition apparatus of A11, the second element further configured to:

obtaining an autocorrelation function of the character sequence;

determining the repetition period of the character strings in the character sequence according to the position of the maximum value of the autocorrelation function;

and determining the preset length according to the repetition period.

A13, the text recognition apparatus of any one of A8 to A12, the clustering module further configured to:

and performing feature clustering on the continuous repeated subsequence features and the basic feature set by adopting a clustering model obtained by pre-training.

A14, the text recognition apparatus of A13, the apparatus further comprising a training module configured to:

obtaining a sample text, labeling the sample text, and obtaining a type label of the sample text;

28页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于词向量进行近似搜索快速提取广告文本主题的方法

Text recognition method and device, electronic equipment and storage medium

相关技术

网友询问留言