Block induction sequencing method and system for text suffix index

文档序号:1391826 发布日期:2020-02-28 浏览:9次 中文

阅读说明:本技术 一种文本后缀索引的分块归纳排序方法及系统 (Block induction sequencing method and system for text suffix index ) 是由 解静仪 农革 于 2019-10-18 设计创作,主要内容包括:本申请实施例适用于数据处理技术领域,提供了一种文本后缀索引的分块归纳排序方法及系统,所述方法包括:针对任一字符串,确定字符串的多个子串或多个后缀,并将多个子串或多个后缀存储至预设的多个数据块;按照预设顺序,扫描各个数据块;针对扫描到的任一当前数据块,根据预设的稳定排序方式对当前数据块中的各个子串或各个后缀进行排序;按照预设顺序,扫描当前数据块中的各个子串或各个后缀;针对扫描到的任一目标子串或目标后缀,确定目标子串的特定类型的前继子串或目标后缀的特定类型的前继后缀所属的目标数据块;将前继子串或前继后缀写入所属的目标数据块。本实施例可以解决构造文本后缀索引时,归纳排序过程数据局部性较差的问题。(The embodiment of the application is applicable to the technical field of data processing, and provides a method and a system for induction sequencing of blocks of a text suffix index, wherein the method comprises the following steps: aiming at any character string, determining a plurality of substrings or a plurality of suffixes of the character string, and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks; scanning each data block according to a preset sequence; sequencing each substring or each suffix in the current data block according to a preset stable sequencing mode aiming at any scanned current data block; scanning each substring or each suffix in the current data block according to a preset sequence; aiming at any scanned target substring or target suffix, determining a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs; and writing the previous sub-string or the previous suffix into the affiliated target data block. The method and the device can solve the problem that the locality of data is poor in the induction sorting process when the text suffix index is constructed.)

1. A method for inductively sequencing text suffix indexes in blocks is characterized by comprising the following steps:

for any character string, determining a plurality of substrings or a plurality of suffixes of the character string, and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks;

scanning each data block according to a preset sequence;

sequencing each substring or each suffix in the current data block according to a preset stable sequencing mode aiming at any scanned current data block;

scanning each substring or each suffix in the current data block according to a preset sequence;

aiming at any scanned target substring or target suffix, determining a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

and writing the forward substring or the forward suffix into the target data block.

2. The method of claim 1, wherein the step of storing the plurality of substrings or the plurality of suffixes to a preset plurality of data blocks comprises:

allocating a storage unit for storing a suffix array of the character string to the character string;

dividing the storage unit into a plurality of data blocks according to a preset blocking mode, and collecting positioning information of each data block;

and storing the plurality of substrings or the plurality of suffixes to the plurality of data blocks according to the positioning information of the data blocks.

3. The method of claim 2, wherein the string of characters includes L-type characters and S-type characters, each L-type character corresponding to an L-type substring from which it originates and an L-type suffix, each S-type character corresponding to an S-type substring from which it originates and an S-type suffix, the method further comprising:

through preprocessing, the value of an L-type character in the character string is the initial position of a corresponding character bucket in the suffix array, and the value of an S-type character in the character string is the end position of the corresponding character bucket in the suffix array;

initializing a block counter of each data block according to the types of the plurality of substrings or the plurality of suffixes; if the substrings are L-shaped substrings or the suffixes are L-shaped suffixes, pointing the block counter to the block head position of each data block; if the substrings are S-type substrings or LMS substrings or the suffixes are S-type suffixes or LMS suffixes, pointing the block counter to the block tail position of each data block;

resetting the block counter of each data block when the type of the currently executed induction sorting task is determined; if the currently executed inductive sorting task is an inductive sorting L-shaped substring or an L-shaped suffix, pointing the block counter to the block head position of each data block; and if the currently executed induction sorting task is an induction sorting S-type substring or an S-type suffix, pointing the block counter to the tail position of each data block.

4. The method according to claim 1, wherein the step of sorting, for any scanned current data block, each substring or each suffix in the current data block according to a preset stable sorting manner comprises:

scanning each substring or each suffix in the current data block one by one;

sorting the substrings to the positions indicated by the barrel counters corresponding to the starting characters according to the starting characters of the substrings and the barrel counters of the current data blocks; or, according to the initial character of the suffix and the bucket counter of the current data block, sorting the suffix to the position indicated by the bucket counter corresponding to the initial character;

if the substring is an L-shaped substring or the suffix is an L-shaped suffix, moving the position indicated by the barrel counter corresponding to the initial character to the right by one bit;

and if the substring is an S-type substring or the suffix is an S-type suffix, moving the position indicated by the barrel counter corresponding to the initial character to the left by one bit.

5. The method according to claim 2, wherein the step of determining, for any scanned target substring or target suffix, a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs comprises:

for any scanned target sub-string, determining the initial character of a specific type of preceding sub-string of the target sub-string, or for any scanned target suffix, determining the initial character of a specific type of preceding suffix of the target suffix;

if the currently executed induction sorting task is an induction sorting L-shaped substring or an L-shaped suffix, only processing an L-shaped preceding substring of the target substring or an L-shaped preceding suffix of the target suffix;

if the currently executed induction sorting task is an induction sorting S-shaped substring or an S-shaped suffix, only processing an S-shaped preceding substring of the target substring or an S-shaped preceding suffix of the target suffix;

and calculating the block number of the target data block according to the initial character and the positioning information of each data block.

6. The method according to claim 5, further comprising, after the step of calculating the block number of the target data block according to the start character and the positioning information of each data block:

judging whether the initial character of the specific type of the previous sub-string of the target sub-string or the initial character of the specific type of the previous suffix of the target suffix is a boundary character;

if yes, reading the number of boundary characters of the target data block;

and repositioning the specific type of the previous sub-string of the currently scanned target sub-string or the new target data block to which the specific type of the previous sub-string of the currently scanned target sub-string belongs according to the starting character, the fixed length of the data block and the number of the boundary characters of the target data block.

7. The method of claim 6, wherein the step of relocating a new target data block to which a specific type of a preceding substring of a currently scanned target substring or a specific type of a preceding suffix of a currently scanned target suffix belongs according to the starting character, the fixed length of the data block, and the number of boundary characters of the target data block comprises:

if the initial character is an L-shaped character, calculating the sum of the numerical value corresponding to the initial character and the number of the boundary characters of the target data block, taking the sum of the numerical value corresponding to the initial character and the number of the boundary characters of the target data block as a dividend, taking the fixed length of the data block as a divisor, and taking the numerical value obtained by downwards rounding the calculated quotient as the block number of a new target data block to be relocated;

if the initial character is an S-type character, calculating the difference between the numerical value corresponding to the initial character and the number of the boundary characters of the target data block, taking the difference between the numerical value corresponding to the initial character and the number of the boundary characters of the target data block as a dividend, taking the fixed length of the data block as a divisor, and taking the numerical value obtained by rounding the quotient downwards as the block number of the relocated new target data block;

adding one to the number of boundary characters of the target data block before relocation;

and updating the block number of the target data block to be consistent with the block number of the new target data block.

8. The method of claim 1, wherein the step of writing the predecessor substring or the successor suffix to the target data block comprises:

determining a starting position of the preceding substring or the preceding suffix in the character string;

judging whether the target data block is the current data block;

if so, writing a numerical value corresponding to the initial position of the previous sub-string or the previous suffix in the character string into a position indicated by a barrel counter corresponding to the initial character according to the initial character of the previous sub-string or the previous suffix and the barrel counter of the current data block;

if the preceding substring is an L-shaped substring or the preceding suffix is an L-shaped suffix, the position indicated by the barrel counter corresponding to the initial character is moved to the right by one bit;

if the preceding substring is an S-shaped substring or the preceding suffix is an S-shaped suffix, moving the position indicated by the barrel counter corresponding to the initial character to the left by one bit;

otherwise, writing the numerical value corresponding to the initial position of the preceding substring or the preceding suffix in the character string into the position indicated by the block counter of the target data block;

if the preceding substring is an L-shaped substring or the preceding suffix is an L-shaped suffix, the position indicated by the block counter is moved to the right by one bit;

and if the previous substring is an S-shaped substring or the previous suffix is an S-shaped suffix, moving the position indicated by the block counter to the left by one bit.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for inductive sorting of blocks of a text suffix index according to any of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for the inductive sorting of blocks of a text suffix index according to any of claims 1 to 8.

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a method and a system for induction and sequencing of blocks of text suffix indexes.

Background

In data processing technology, Suffix Array (SA) is an important and widely used data structure, which is a spatially compact replacement of Suffix tree and can be used in various fields such as data compression, genome comparison, full-text search, and the like. With the development of internet technology, a large amount of multi-source heterogeneous data is generated all the time around the world. In order to manage and retrieve these data efficiently, a need exists for a full-text index that can be constructed and retrieved efficiently, while a suffix index, i.e., a suffix array, is a full-text index that can be used in this scenario.

The suffix index for constructing multisource heterogeneous data is in principle identical to the suffix index (or suffix array) for constructing text, and the latter has been a key topic for researchers to search. To date, methods and systems for constructing suffix arrays based on the Inductive Sorting (IS) method have achieved optimal temporal and spatial performance both theoretically and practically. Therefore, many computational models for modern general purpose computers, such as: memory serial, memory parallel, external memory serial, etc. researchers have proposed many high-performance methods for constructing suffix arrays based on the IS method and developed programs or systems matching them. However, due to the problem of poor data locality of the IS methods, the methods and systems have difficulty in effectively utilizing the performance of Cache memory, namely Cache, of modern general-purpose computers, thereby limiting the time performance of the methods and systems.

In the field of full-text retrieval, the efficiency of full-text index construction directly determines the usability of the full-text index construction. Therefore, in order to further improve the construction efficiency of the text suffix index, the problem of poor data locality of the IS method needs to be solved.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and a system for induction sorting of blocks of a text suffix index, so as to solve the problem in the prior art that data locality is poor in an induction sorting process when a text suffix index is constructed.

A first aspect of the embodiments of the present application provides a method for inductively sorting text suffix indexes by blocks, including:

for any character string, determining a plurality of substrings or a plurality of suffixes of the character string, and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks;

scanning each data block according to a preset sequence;

sequencing each substring or each suffix in the current data block according to a preset stable sequencing mode aiming at any scanned current data block;

scanning each substring or each suffix in the current data block according to a preset sequence;

aiming at any scanned target substring or target suffix, determining a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

and writing the forward substring or the forward suffix into the target data block.

A second aspect of the embodiments of the present application provides a system for generalized sorting of blocks of text suffix indexes, including:

the storage module of the substrings or the suffixes is used for determining a plurality of substrings or a plurality of suffixes of the character string aiming at any character string and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks;

the data block scanning module is used for scanning each data block according to a preset sequence;

the substring or suffix sorting module is used for sorting each substring or each suffix in any scanned current data block according to a preset stable sorting mode;

a substring or suffix scanning module, configured to scan each substring or each suffix in the current data block according to a preset sequence;

a target data block determination module, configured to determine, for any scanned target substring or target suffix, a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

and the forward substring or forward suffix writing module is used for writing the forward substring or the forward suffix into the target data block to which the forward substring or the forward suffix belongs.

A third aspect of the embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for inductively sorting the text suffix index into blocks according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the method for inductively sorting chunks of a text suffix index according to the first aspect.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, for any character string, after determining a plurality of substrings or a plurality of suffixes of the character string and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks, each data block may be scanned according to a preset order, for any scanned current data block, each substring or each suffix in the current data block is sorted according to a preset stable sorting mode, then each substring or each suffix in the current data block is scanned according to the preset order, and for any scanned target substring or target suffix, the substring or the suffix may be written into the target data block to which the corresponding suffix belongs by determining the target data block to which the corresponding suffix of the specific type of the corresponding substring or corresponding suffix belongs. In the embodiment, the previous substrings or previous prefixes and suffixes to be subjected to induction sorting are firstly positioned to the target data blocks to which the previous substrings or previous suffixes belong in the storage unit, and then are stably sorted to the final position of the target data blocks according to the initial characters in the target data blocks, instead of being directly subjected to induction sorting into the character buckets according to the initial characters by the IS method, and the bucket counters with the magnitude of O (n) and needing random access in the induction sorting process are compressed into the block counters with the magnitude of O (number of blocks), so that the data locality of the method and the system for constructing the suffix index IS improved. In the practical application field, on a modern general-purpose computer equipped with a Cache memory, the method and the system provided by the embodiment can improve the Cache hit rate, so that the time efficiency of constructing the suffix index is higher, and the resource utilization rate is better.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a flowchart illustrating steps of a method for generalized sorting of blocks indexed by text suffixes according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a multi-core computer according to one embodiment of the present application;

FIG. 3 is a data processing flow diagram of a block inductive sorting method of text suffix index according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a block inductive sorting system for text suffix indexes, according to an embodiment of the present application;

fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The technical solution of the present application will be described below by way of specific examples.

First, technical terms that may be used in the present application are collectively described herein.

Character string: a character string X of length n is a character array X [0 … n-1] in which n characters belonging to the character set Σ are arranged in order according to a certain rule, wherein X [ n-1] is assumed to be a character $, which appears only once in X and has the smallest lexicographic order.

Substrings: x [ i … j ], i is more than or equal to 0 and less than or equal to j and less than or equal to n-1, namely a character string consisting of all characters from X [ i ] to X [ j ] (including X [ i ] and X [ j ]).

Suffix: namely X [ i … n-1], i is 0. ltoreq. i.ltoreq.n-1, generally designated suf (X, i).

Suffix Array (SA): the data structure is obtained by arranging all suffixes of X in a lexicographic order from small to large and storing the initial positions of all the suffixes in X in an integer array in the arranged order.

Type of character and suffix: the characters X [ i ] in X can be classified into two types, i-type and S-type, where, if (1) i ═ n-1; (2) x [ i ] < X [ i +1 ]; (3) x [ i ] ═ X [ i +1] and X [ i +1] are S-type characters, then X [ i ] is an S-type character, otherwise it is an L-type character. Further, if X [ i ] is an S-type character and X [ i-1] is an L-type character, then X [ i ] is also an LMS character. The suffix suf (X, i) to X is of the same type as its first character X [ i ].

L/S/LMS substrings: for the sub-string X [ i … j ] (i ≠ j) of X, if X [ i ] and X [ j ] are both L/S/LMS characters and there is no other L/S/LMS character between them, X [ i … j ] is L/S/LMS sub-string, besides, X [ n-1] is also LMS sub-string.

Preceding characters, substrings, and suffixes: for one of the characters X [ i ] (i >0) of the string X, the following character is X [ i-1], and similarly, the sub-string or the following suffix of the sub-string or the suffix starting with this character X [ i ] starts from X [ i-1 ].

Block to which substring or suffix belongs: for the SA which is partitioned according to the preset partitioning manner, in the case that all suffixes of X are sorted, the block to which suf (X, i) belongs is the block to which it belongs, and similarly, the block to which the substring X [ i … j ] belongs is the same.

Barrel: multiple suffixes starting from the same character occupy a contiguous block of area in the SA called the bucket for that character.

Boundary character: on the basis that the SA is partitioned, a character bucket may span multiple blocks, and this character is referred to as the boundary character of these blocks.

Referring to fig. 1, a schematic flow chart illustrating steps of a method for inductively sorting text suffix indexes by blocks according to an embodiment of the present application is shown, which may specifically include the following steps:

s101, aiming at any character string, determining a plurality of substrings or a plurality of suffixes of the character string, and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks;

it should be noted that the method can be applied to a multi-core computer. Namely, by implementing the method, the data locality of the core flow can be improved in the process of constructing the text suffix index in the memory of the multi-core computer. Namely, the data locality in the process of induction sequencing of the previous substrings or the previous suffixes when the IS method IS used for constructing the suffix index IS improved, so that the Cache hit rate of the text suffix index construction method and the matching system thereof on a modern general-purpose computer IS improved, and the time performance of the text suffix index construction method and the matching system thereof IS further improved.

In this embodiment, for any character string, at a particular stage of constructing its text suffix index, multiple substrings or multiple suffixes of the character string may be determined, and then a storage unit for storing the character string and subsequently a suffix array of the character string may be allocated by the computer system for the character string.

The character string may include L-type characters each corresponding to an L-type substring from which it originates and an L-type suffix, and S-type characters each corresponding to an S-type substring from which it originates and an S-type suffix.

Initially, the character string and the allocated storage unit may be preprocessed first, and the initialization task is completed.

In this embodiment, the pre-processing and initialization tasks may include blocking the memory cells storing the suffix array, initializing block counters, bucket counters, and so on.

Therefore, after the memory unit required for storing the suffix array is allocated, the memory unit can be divided into a plurality of data blocks, the positioning information of each data block is collected, and then a plurality of substrings or a plurality of suffixes are stored in the plurality of data blocks according to the positioning information of each data block.

In this embodiment, the storage unit may be partitioned in different processing manners, such as uniform partitioning or partitioning according to buckets, which is not specifically limited in this embodiment.

For any character string, required storage units can be firstly allocated to the character string and a suffix array of the character string, and then the storage units for storing the suffix array of the character string are divided into a plurality of data blocks according to a preset dividing mode. The preset dividing manner may be determined by a user according to actual needs, and this embodiment does not limit this.

Taking uniform partitioning as an example, if the preset partitioning manner is uniform partitioning and the partitioning number is m, the storage unit of the suffix array of the character string may be partitioned into a plurality of blocks of the same size to obtain the uniform blocksBlock, in which the symbol

Figure BDA0002238686170000062

Meaning that the ratio between n and m is calculated and rounded up. In the case of uniform blocking, the block length of each data block is fixed except for the last block. Of course, if the data blocks are partitioned according to buckets, the block length of each obtained data block is generally not fixed.

After blocking, the positioning information of each data block can be collected according to a blocking mode, and the positioning information is used for calculating the block number of a target data block with a specific type of substring or suffix in the suffix array.

Meanwhile, through preprocessing, the value of the L-type character in the character string can be the initial position of the corresponding character bucket in the suffix array, and the value of the S-type character in the character string can be the final position of the corresponding character bucket in the suffix array.

When initializing the block counter of each data block according to the types of the multiple substrings or the multiple suffixes, if the multiple substrings are L-type substrings or the multiple suffixes are L-type suffixes, the block counter can point to the block head position of each data block; if the substrings are S-type substrings or LMS substrings, or the suffixes are S-type suffixes or LMS suffixes, the block counter may be pointed to the position of the block tail of each data block.

During subsequent processing, the block counters for the various data blocks may be reset when determining the type of inductive sorting task currently being executed. For example, if the currently executed inductive sorting task is an inductive sorting L-type substring or an L-type suffix, the block counter may be pointed to the block head position of each data block; if the currently executed inductive sorting task is an inductive sorting S-type substring or an S-type suffix, the block counter can point to the position of the block tail of each data block.

S102, scanning each data block according to a preset sequence;

in this embodiment, the preset order is affected by the type of the currently executed induction sorting task, and may be an order from left to right, or an order from right to left, which is not limited in this embodiment.

In addition, the manner of scanning each data block of the suffix array may be varied. For example, the blocks of data may be scanned serially, block by block, pipelined in parallel, etc. This embodiment is not limited thereto.

S103, sequencing each substring or each suffix in any scanned current data block according to a preset stable sequencing mode;

for the current data block, in case of just storing multiple substrings or multiple suffixes in S101, or further in case of having written a preceding substring or a preceding suffix, it may be processed, i.e. all its substrings or all suffixes are sorted to their final positions in a preset stable sorting manner.

It should be noted that the way of processing the data block is affected by the way of scanning each data block of the suffix array, and after all the substrings or all the suffixes in the current data block are stably ordered in a single-thread serial manner, each substring or each suffix in the current data block is scanned and processed according to a preset sequence; or, the pipeline scans each data block of the suffix array in parallel, scans and processes the substrings or suffixes in the previous data block in a preset order while stably ordering all the substrings or suffixes in the current data block in multiple threads, and the like.

Take bucket ordering as an example. The substrings or suffixes in the current data block can be scanned one by one; then, according to the initial character of the substring and the barrel counter of the current data block, the substring is sequenced to the position indicated by the barrel counter corresponding to the initial character; or, according to the initial character of the suffix and the bucket counter of the current data block, sorting the suffix to the position indicated by the bucket counter corresponding to the initial character.

And updating the position indicated by the bucket counter corresponding to the initial character in the current data block when one substring or suffix is sequenced.

For example, if the substring is an L-type substring or the suffix is an L-type suffix, the position indicated by the bucket counter may be shifted to the right by one bit; if the substring is an S-type substring or the suffix is an S-type suffix, the position indicated by the bucket counter may be shifted to the left by one bit.

S104, scanning each substring or each suffix in the current data block according to a preset sequence;

in this embodiment, the preset order is affected by the type of the currently executed induction sorting task, and may be an order from left to right, or an order from right to left, which is not limited in this embodiment.

S105, aiming at any scanned target substring or target suffix, determining a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

a target substring or target suffix may refer to that substring or suffix that was scanned in the current round.

For each scanned target substring, locating the data block to which the previous substring of a specific type belongs in the storage unit and writing the previous substring into the located data block; accordingly, for each scanned target suffix, the suffix may also be written into the located data block after locating the data block to which its specific type of suffix belongs in the memory cell.

In this embodiment, for any scanned target sub-string, the starting character of a particular type of previous sub-string of the target sub-string may be determined first. Of course, for any target suffix scanned, the starting character of a particular type of preceding suffix for that target suffix also needs to be determined first.

If the currently executed induction sorting task is an induction sorting L-shaped substring or an L-shaped suffix, only an L-shaped preceding substring of the target substring or an L-shaped preceding suffix of the target suffix can be processed; if the currently executed induction sorting task is induction sorting S-shaped substring or S-shaped suffix, only the S-shaped preceding substring of the target substring or the S-shaped preceding suffix of the target suffix can be processed.

Then, the block number of the target data block is calculated according to the initial character and the positioning information of each data block.

It should be noted that, according to the different blocking manners, the collected positioning information of each data block may also be different. For example, in the case of uniform blocking, the location information of each data block includes boundary character information, and in the case of bucket blocking, the location information does not include the boundary character information.

In this embodiment, calculating the block number of the target data block may be implemented according to the starting character of the preceding substring or the preceding suffix, the block length of the data block, and the number of boundary characters of the data block.

For uniform chunking, the block length of the data block is fixed; for other blocking methods, the block length of each data block obtained after blocking may not be fixed.

Taking the example of calculating the block number of the target data block under the condition of uniform blocking, the block number of the target data block may be rounded down by the ratio between the numerical value corresponding to the start character and the fixed length of the data block.

In the concrete implementation, for a specific type of scanned previous substring of a certain target substring, the block number of the target data block can be calculated by determining the starting character c of the previous substring and the fixed length m of the data block

Figure BDA0002238686170000091

Wherein, the symbol

Figure BDA0002238686170000092

Indicating a rounding down. Similarly, for a specific type of preceding suffix of a scanned target suffix, the block number of the target data block to which it belongs may also be calculated using the above formula.

In general, a special process is required for calculating a block number of a target data block for a preceding substring or a preceding suffix whose starting character is a boundary character.

Therefore, after the block number of the target data block to which a certain previous sub-string or previous suffix belongs is calculated according to the above formula, it can be determined whether the starting character of the previous sub-string of the specific type of the target sub-string or the starting character of the previous suffix of the specific type of the target suffix is a boundary character.

If the starting character is a boundary character, after reading the number of boundary characters of the target data block, relocating a specific type of previous sub-string of the currently scanned target sub-string or a new target data block to which the specific type of previous suffix of the currently scanned target sub-string belongs according to the starting character, the fixed length of the data block and the number of boundary characters of the target data block.

That is, for a previous sub-string or a previous suffix whose starting character is a boundary character, the block number of a new target data block to which the previous sub-string or the previous suffix belongs needs to be newly determined according to the starting character, the block length of the data block, and the number of boundary characters of the data block.

In a specific implementation, if the start character is an L-type character, the sum of the number corresponding to the start character and the number of boundary characters of the target data block may be calculated first, then the sum of the number corresponding to the start character and the number of boundary characters of the target data block is used as a dividend, the fixed length of the data block is used as a divisor, and the calculated quotient is rounded down to obtain a number as the block number of the relocated new target data block.

If the initial character is an S-shaped character, the difference between the numerical value corresponding to the initial character and the number of the boundary characters of the target data block can be calculated, then the difference between the numerical value corresponding to the initial character and the number of the boundary characters of the target data block is used as a dividend, the fixed length of the data block is used as a divisor, and the numerical value obtained by rounding down the quotient obtained by calculation is used as the block number of the relocated new target data block.

For example, if the start character c is a boundary character, the block length m and the number num of boundary characters of the target data block B are determinedBIn the case of (1), if the start character c is an L-type character, the block number of the new target data block obtained is relocated

Figure BDA0002238686170000101

Otherwise, if the initial character c is S-type character, then

Figure BDA0002238686170000102

After the above relocation is completed, the number of boundary characters num can be updatedBI.e. the number num of boundary characters of the target data block before relocationBAnd adding one.

And finally, changing B to B', namely updating the block number of the target data block to be consistent with the block number of the new target data block.

And S106, writing the previous substring or the previous suffix into the target data block.

After the target data block number to which each preceding substring or each preceding suffix belongs is calculated according to the foregoing steps, it may be written into the determined data block.

In a specific implementation, the starting position of a preceding substring or a preceding suffix in a character string may be determined first, then whether a target data block is a current data block is determined, if so, according to the starting character of the preceding substring or the preceding suffix and a bucket counter of the current data block, a numerical value corresponding to the starting position of the preceding substring or the preceding suffix in the character string is written into a position indicated by the bucket counter corresponding to the starting character, and the position indicated by the bucket counter corresponding to the starting character is updated; otherwise, writing the numerical value corresponding to the initial position of the preceding substring or the preceding suffix in the character string into the position indicated by the block counter in the target data block, and updating the position indicated by the block counter in the target data block.

For example, if the target data block is a current data block and if the previous substring is an L-type substring or the previous suffix is an L-type suffix, the position indicated by the bucket counter corresponding to the start character of the previous substring or the previous suffix may be shifted to the right by one bit; if the preceding substring is an S-type substring or the preceding suffix is an S-type suffix, the position indicated by the bucket counter may be shifted to the left by one bit.

If the target data block is not the current data block and if the previous substring is an L-type substring or the previous suffix is an L-type suffix, the position indicated by the block counter in the target data block may be shifted to the right by one bit; if the preceding substring is an S-type substring or the preceding suffix is an S-type suffix, the position indicated by the block counter may be shifted to the left by one bit.

And completing the currently executed inductive sorting task until all the data blocks in the storage unit have completed the steps.

In this embodiment of the application, for any character string, after determining a plurality of substrings or a plurality of suffixes of the character string and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks, each data block may be scanned according to a preset order, for any scanned current data block, each substring or each suffix in the current data block is sorted according to a preset stable sorting manner, then, according to the preset order, each substring or each suffix in the current data block is scanned, for any scanned target substring or target suffix, and by determining a target data block to which a specific type of a previous substring or a specific type of a target suffix of the target substring belongs, the previous substring or the previous suffix may be written into the target data block to which the previous substring or the target suffix belongs. In the embodiment, the previous substrings or previous prefixes and suffixes to be subjected to induction sorting are firstly positioned to the target data blocks to which the previous substrings or previous suffixes belong in the storage unit, and then are stably sorted to the final position of the target data blocks according to the initial characters in the target data blocks, instead of being directly subjected to induction sorting into the character buckets according to the initial characters by the IS method, and the bucket counters with the magnitude of O (n) and needing random access in the induction sorting process are compressed into the block counters with the magnitude of O (number of blocks), so that the data locality of the method and the system for constructing the suffix index IS improved. In the practical application field, on a modern general-purpose computer equipped with a Cache memory, the method and the system provided by the embodiment can improve the Cache hit rate, so that the time efficiency of constructing the suffix index is higher, and the resource utilization rate is better.

For ease of understanding, the following description will be made of a full example of the block inductive sorting method of the text suffix index of the present application.

FIG. 2 is a schematic diagram of a multi-core computer according to an embodiment of the present application. In the multi-core computer shown in fig. 2, the Cache hit rate can be improved by executing the data processing flow of the block induction sorting method of the text suffix index shown in fig. 3, so that the time efficiency of constructing the suffix index is higher and the resource utilization rate is better.

As shown in fig. 2, the multi-core computer may include: the device comprises a front-end unit, an analysis unit and a storage unit. The functions of the units are as follows.

A front-end unit: reading an input character string X from a storage unit, allocating space in the storage unit for a suffix array SA thereof, preprocessing the X and the SA according to actual conditions, completing an initialization task, collecting block positioning information according to a blocking mode, writing the block positioning information into the storage unit, and storing initial substrings or suffixes of a currently executed induction and sequencing task to a plurality of data blocks of the SA according to the block positioning information.

An analysis unit: reading preprocessed X and SA, block positioning information and other auxiliary information from the storage unit, wherein the SA stores a plurality of initial substrings or a plurality of suffixes of the currently executed inductive sorting task, the currently executed inductive sorting task is completed in a mode of good data locality by using a specific method on the basis, and the result of the currently executed inductive sorting task is written into the SA of the storage unit.

A storage unit: the system is used for storing various data to be read and written in the process of finishing the currently executed inductive sorting task, including the character string X of the input system, the suffix array SA of the output result of the currently executed inductive sorting task, a block counter, block positioning information and the like.

The front unit may include: the device comprises an X and SA preprocessing module, an initialization module, a block positioning information collection module and an initial substring or suffix storage module. The analysis unit may include: the device comprises a decision module, an intra-block stable ordering module and a preceding sub-string or suffix block positioning and writing module. Each module is respectively used for executing different functions in the process of induction and sequencing of the blocks.

The specific flow of the inductive sorting of the blocks is described below with reference to fig. 3. According to the flow shown in fig. 3, the whole inductive sorting process may include the following steps:

and S300, starting.

At the beginning, a character string X is input on a computer with a multi-core processor, and the currently executed inductive sorting task is completed on the basis of the character string X. In the whole process, the memory space of the computer can be used as a corresponding storage unit, and no matter input X and output SA are adopted, or various temporary data required in the process are read and written in the storage unit when each unit and module in the system process tasks. Considering that the life cycles of the data in the storage units are different, there may be multiple engineering processing manners, such as reuse space, etc., in the system implementation process, and this is not specifically illustrated and limited herein.

S301, preprocessing the X and SA according to the conditions of the X and the SA, and finishing an initialization task.

In this step, SA is a storage unit for storing a suffix array of the string X. Initially, SA is empty. This step requires pre-processing tasks to be done depending on the X and SA conditions, including but not limited to blocking the unblocked SA using a specific method. Wherein the specific method represents different preprocessing manners, such as chunking by fixed length, chunking by character bucket, and the like. Also, initialization tasks include, but are not limited to, initializing block counters, bucket counters, etc., for indicating where substrings and suffixes should be written within a block.

As an example, the preprocessing and initializing of X and SA may specifically include:

(1) if the L-type character in X is not the start position of the corresponding character bucket in SA or the S-type character is not the end position of the corresponding character bucket in SA, then X is renamed so as to meet the rule.

(2) If the SA is not partitioned according to the fixed length, the SA is partitioned into a plurality of blocks with the same size according to a value m set by a user or set by default of a system to obtain

Figure BDA0002238686170000131

And (5) blocking.

(3) Initializing a bucket counter bkt and a block counter ctr of each block, initializing the block counter to the block head position of each block if a plurality of initial substrings of a currently executed induction sequencing task are L-type substrings or a plurality of suffixes are L-type suffixes before executing the step of S303, and initializing the block counter to the block tail position of each block if the plurality of substrings are S-type substrings or LMS substrings or the plurality of suffixes are S-type suffixes or LMS suffixes; after the step of S303 is completed and before the step of S304 is executed, if the currently executed induction sorting task is an L-type substring or an L-type suffix of induction sorting X, the block counter is reset to the block head position of each block, otherwise, the block counter is reset to the block tail position of each block.

S302, collecting block positioning information.

In this step, block location information may be collected according to a blocking manner, and used to calculate a block number of a block to which a substring or a suffix belongs within the SA.

(1) And counting the occurrence times of each character in the X, and calculating and storing the boundary character information of each block by using a prefix sum method.

(2) Let the initial number of boundary characters num of each block be 0.

And S303, storing the initial multiple substrings or multiple suffixes of the currently executed induction and sequencing task to the SA.

In this step, the currently executed induction sorting task indicates that when the IS method IS used to construct the suffix index, ① scans SA once from left to right to induce all L-type substrings or L-type suffixes of the sorting X, wherein the initial substrings or suffixes are generally LMS substrings or LMS suffixes, and other substrings or suffixes from which L-type substrings or L-type suffixes can be derived, or ② scans SA once from right to left to induce all S-type substrings or S-type suffixes of the sorting X, wherein the initial substrings or suffixes are generally L-type substrings or L-type suffixes, and other substrings or suffixes from which S-type substrings or S-type suffixes can be derived.

The specific step of locating the block to which the initial substrings or suffixes belong in the SA is identical to the corresponding step in S306.

S304, determining whether all data blocks of the SA have been traversed in a preset order in a preset manner?

In this step, the preset mode indicates that each data block of the SA can be scanned and processed in multiple modes, for example, two steps S305 and S306 can be executed serially on a single thread of the current data block, and all data blocks are processed in a loop; alternatively, the pipeline scans the SA blocks in parallel, performs S306 on the previous data block while performing S305 on the current data block in multiple threads, and so on. The preset sequence refers to that according to different types of the currently executed induction sorting task, the data blocks of the SA may be scanned from left to right, and may also be scanned from right to left. In this step shown in fig. 3, the first of the above-described preset modes is employed.

And jumping to S307 if the traversal of all the data blocks is finished, otherwise, jumping to S305.

S305, sorting all the existing substrings or suffixes in the current data block to the final position.

This step is to use a specific method to stably order all the existing substrings or suffixes to their final positions within the SA current data block. The specific method may be a stable sorting method such as bucket sorting, cardinal sorting, or the like.

Take bucket ordering as an example. This step may include:

(1) scanning the current block of SA from left to right, assuming block b, combining each scanned substring or suffix with the bucket counter bkt of the block according to its starting character cbSorting the substring or suffix bucket into bkt in SAb[c]The indicated position.

(2) Updating bktb[c]If the substring or suffix is L-shapedThen bkt will beb[c]The indicated position is shifted one bit to the right, otherwise it is shifted one bit to the left.

And after all the substrings or all the suffixes in the current block are stably sequenced, jumping to S306.

S306, scanning all substrings or all suffixes in the SA current block according to a preset sequence, positioning a block B to which a specific type of previous substring or a specific type of previous suffix belongs in the SA for each scanned substring or suffix, and writing the sequence of the previous substring or the previous suffix into the block B.

In this step, similarly to S304, the preset order refers to that, depending on the type of the inductive sorting task currently executed, all substrings or all suffixes in the current block may be scanned SA from left to right, or from right to left.

In this step, the specific type refers to that only the L-type preceding substring or L-type preceding suffix of the scanned substring is processed if the currently executed induction sorting task is induction sorting of the L-type substring or L-type suffix; and if the current round induction sorting task is to induce and sort the S-shaped substrings or S-shaped suffixes, only processing the S-shaped preceding substrings or S-shaped preceding suffixes of the scanned substrings.

In this task, the block B to which the substring or suffix currently scanned in the current data block belongs in the SA can be calculated by combining the blocking manner and the block location information collected in S302.

According to the type of the currently executed inductive sorting task, the SA current block may be sequentially scanned from left to right or from right to left, and for each scanned substring or suffix, the following three tasks are sequentially completed:

(1) b is calculated directly using the start character c and block length m of the particular type of preceding substring or the particular type of preceding suffix. In this step, the method for calculating B is

Figure BDA0002238686170000151

(2) And checking whether the initial character c of the previous sub-string or the previous suffix is a boundary character, if so, relocating the block B of the previous sub-string or the previous suffix in the SA. This task includes:

(2.1) Using the starting character c, the block length m and the number num of boundary characters of the block B of the preceding substring or the preceding suffixBAnd recalculating the block number B' of the block to which the previous substring or previous suffix belongs in the SA. The calculation method comprises the following steps: if c is an L-type character, thenOtherwise

Figure BDA0002238686170000153

(2.2) update numBThe number of boundary characters of block B is represented, i.e. incremented by one.

(2.3) update the block number of the block B to which the previous substring or previous suffix belongs in the SA to be identical to the new block number B' after relocation.

(3) And writing the previous substrings or the previous suffixes to be subjected to induction sorting into the block B. This task includes:

(3.1) judging whether the block B is the current data block, if so, writing the initial position of the previous substring or the previous suffix in the X into bkt in the SAB[c]Indicated position, otherwise the starting position of the preceding substring or preceding suffix in X is written into ctr in SABThe indicated position.

And (3.2) updating the counter. If block B is the current data block, update bktB[c]Otherwise, updating ctrB. The update rule is as follows: and if the preceding substring or the preceding suffix is L-shaped, moving the position indicated by the counter to be updated by one bit to the right, and otherwise, moving the position indicated by the counter to be updated by one bit to the left.

After all the substrings or all the suffixes of the current block are scanned and processed, the process goes to S304.

And S307, ending.

In the embodiment, the preceding substrings or the preceding prefixes and suffixes to be subjected to induction sorting are firstly positioned to the blocks of the preceding substrings or the preceding suffixes in the SA, and then are stably sorted to the final positions of the blocks according to the first characters, instead of being directly sorted into the character buckets according to the first characters in an IS method, a bucket counter with the magnitude order of O (n) and needing random access in the induction sorting process can be compressed into a block counter with the magnitude order of O (block number), so that the data locality of the method and the system for constructing the suffix index IS improved, the Cache hit rate IS improved, the time efficiency for constructing the suffix index IS higher, and the resource utilization rate IS more excellent.

It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Referring to fig. 4, a schematic diagram of a block generalized sorting system for text suffix index according to an embodiment of the present application is shown, which may specifically include the following modules:

a substring or suffix storage module 401, configured to determine, for any character string, multiple substrings or multiple suffixes of the character string, and store the multiple substrings or the multiple suffixes to multiple preset data blocks;

a data block scanning module 402, configured to scan each data block according to a preset sequence;

a substring or suffix sorting module 403, configured to, for any scanned current data block, sort each substring or each suffix in the current data block according to a preset stable sorting manner;

a substring or suffix scanning module 404, configured to scan each substring or each suffix in the current data block according to a preset sequence;

a target data block determination module 405, configured to determine, for any scanned target substring or target suffix, a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

a previous substring or previous suffix writing module 406, configured to write the previous substring or the previous suffix into the target data block to which the previous substring or the previous suffix belongs;

in this embodiment, the substring or suffix storage module 401 may specifically include the following sub-modules:

the storage unit distribution submodule is used for distributing a storage unit for storing a suffix array of the character string for the character string;

the data block dividing submodule is used for dividing the storage unit into a plurality of data blocks according to a preset blocking mode and collecting positioning information of each data block;

and the substring or suffix storage submodule is used for storing the plurality of substrings or the plurality of suffixes to the plurality of data blocks according to the positioning information of the data blocks.

In an embodiment of the present application, the character string includes L-type characters and S-type characters, each L-type character corresponds to an L-type substring and an L-type suffix starting from the L-type character, and each S-type character corresponds to an S-type substring and an S-type suffix starting from the S-type character, and the system may further include the following modules:

the preprocessing module is used for enabling the value of the L-shaped character in the character string to be the initial position of the corresponding character bucket in the suffix array, and enabling the value of the S-shaped character in the character string to be the end position of the corresponding character bucket in the suffix array; initializing a block counter of each data block according to the types of the plurality of substrings or the plurality of suffixes; if the substrings are L-shaped substrings or the suffixes are L-shaped suffixes, pointing the block counter to the block head position of each data block; if the substrings are S-type substrings or LMS substrings or the suffixes are S-type suffixes or LMS suffixes, pointing the block counter to the block tail position of each data block; resetting the block counter of each data block when the type of the currently executed induction sorting task is determined; if the currently executed inductive sorting task is an inductive sorting L-shaped substring or an L-shaped suffix, pointing the block counter to the block head position of each data block; and if the currently executed induction sorting task is an induction sorting S-type substring or an S-type suffix, pointing the block counter to the tail position of each data block.

In this embodiment of the present application, the substring or suffix sorting module 403 may specifically include the following sub-modules:

a substring or suffix scanning submodule for scanning each substring or each suffix in the current data block one by one;

the sorting submodule is used for sorting the substrings to the positions indicated by the barrel counters corresponding to the starting characters according to the starting characters of the substrings and the barrel counters of the current data blocks; or, according to the initial character of the suffix and the bucket counter of the current data block, sorting the suffix to the position indicated by the bucket counter corresponding to the initial character;

a bucket counter updating submodule, configured to move the position indicated by the bucket counter corresponding to the starting character to the right by one bit if the substring is an L-type substring or the suffix is an L-type suffix; and if the substring is an S-type substring or the suffix is an S-type suffix, moving the position indicated by the barrel counter corresponding to the initial character to the left by one bit.

In this embodiment of the present application, the target data block determining module 405 may specifically include the following sub-modules:

the starting character determining submodule is used for determining the starting characters of a specific type of previous sub-string of the target sub-string aiming at any scanned target sub-string or determining the starting characters of a specific type of previous suffix of the target suffix aiming at any scanned target suffix; if the currently executed induction sorting task is an induction sorting L-shaped substring or an L-shaped suffix, only processing a preceding L-shaped substring of the target substring or a preceding L-shaped suffix of the target suffix; if the currently executed induction sorting task is an induction sorting S-shaped substring or an S-shaped suffix, only processing a preceding S-shaped substring of the target substring or a preceding S-shaped suffix of the target suffix;

and the target data block calculation submodule is used for calculating the block number of the target data block according to the initial character and the positioning information of each data block.

In this embodiment of the application, the target data block determining module 405 may further include sub-modules:

a boundary character judgment sub-module, configured to judge whether a starting character of a specific type of preceding sub-string of the target sub-string or a starting character of a specific type of preceding suffix of the target suffix is a boundary character;

a boundary character number reading submodule, configured to, if yes, read the number of boundary characters of the target data block;

and the target data block repositioning submodule is used for repositioning a specific type of previous sub-string of the currently scanned target sub-string or a new target data block to which the specific type of previous suffix of the currently scanned target suffix belongs according to the starting character, the fixed length of the data block and the number of boundary characters of the target data block, and updating the block number of the target data block to be consistent with the block number of the new target data block.

In this embodiment of the present application, the target data block relocation submodule may specifically include the following units:

a first relocation unit, configured to calculate, if the start character is an L-type character, a sum of a numeric value corresponding to the start character and a number of boundary characters of the target data block, where the sum of the numeric value corresponding to the start character and the number of boundary characters of the target data block is a dividend, the fixed length of the data block is a divisor, and a numeric value obtained by rounding down the calculated quotient is used as a block number of a new relocated target data block;

a second relocation unit, configured to calculate a difference between a numerical value corresponding to the start character and the number of boundary characters of the target data block if the start character is an S-type character, take the difference between the numerical value corresponding to the start character and the number of boundary characters of the target data block as a dividend, take the fixed length of the data block as a divisor, and calculate a numerical value obtained by rounding down the quotient as a block number of a relocated new target data block;

and the boundary character number updating unit is used for adding one to the number of the boundary characters of the target data block before relocation.

And the target data block number updating unit is used for updating the block number of the target data block to be consistent with the block number of the new target data block.

In this embodiment, the prefix string or prefix-suffix-writing module 406 may specifically include the following sub-modules:

a starting position determining submodule for determining a starting position of the preceding sub-string or the preceding suffix in the character string;

a current data block judgment submodule, configured to judge whether the target data block is the current data block;

a writing sub-module, configured to, if a determination result of the current data block determination sub-module is yes, write, according to a starting character of the previous substring or the previous suffix and a bucket counter of the current data block, a numerical value corresponding to a starting position of the previous substring or the previous suffix in the character string into a position indicated by the bucket counter corresponding to the starting character, if the previous substring is an L-type substring or the previous suffix is an L-type suffix, move a position indicated by the bucket counter corresponding to the starting character by one bit to the right, and if the previous substring is an S-type substring or the previous suffix is an S-type suffix, move a position indicated by the bucket counter corresponding to the starting character by one bit to the left;

if the judgment result is negative, writing the numerical value corresponding to the initial position of the forward substring or the forward suffix in the character string into the position indicated by the block counter of the target data block; if the preceding substring is an L-shaped substring or the preceding suffix is an L-shaped suffix, the position indicated by the block counter is moved to the right by one bit; and if the previous substring is an S-shaped substring or the previous suffix is an S-shaped suffix, moving the position indicated by the block counter to the left by one bit.

For the system embodiment, since it is basically similar to the method embodiment, it is described relatively simply, and the relevant points can be referred to the description of the method embodiment section.

Referring to fig. 5, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 5, the terminal device 500 of the present embodiment includes: a processor 510, a memory 520, and a computer program 521 stored in the memory 520 and executable on the processor 510. The processor 510, when executing the computer program 521, implements the steps in various embodiments of the above-described method for generalized ordering of blocks of text suffix indexes, such as the steps S101 to S106 shown in fig. 1. Alternatively, the processor 510, when executing the computer program 521, implements the functions of each module/unit in each system embodiment described above, for example, the functions of the modules 401 to 406 shown in fig. 4.

Illustratively, the computer program 521 may be partitioned into one or more modules/units that are stored in the memory 520 and executed by the processor 510 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used for describing the execution process of the computer program 521 in the terminal device 500. For example, the computer program 521 may be divided into a substring or suffix storage module, a data block scanning module, a substring or suffix sorting module, a substring or suffix scanning module, a target data block determination module, and a previous substring or previous suffix writing module, each of which functions specifically as follows:

the storage module of the substrings or the suffixes is used for determining a plurality of substrings or a plurality of suffixes of the character string aiming at any character string and storing the plurality of substrings or the plurality of suffixes to a plurality of preset data blocks;

the data block scanning module is used for scanning each data block according to a preset sequence;

the substring or suffix sorting module is used for sorting each substring or each suffix in any scanned current data block according to a preset stable sorting mode;

a substring or suffix scanning module, configured to scan each substring or each suffix in the current data block according to a preset sequence;

a target data block determination module, configured to determine, for any scanned target substring or target suffix, a target data block to which a specific type of preceding substring of the target substring or a specific type of preceding suffix of the target suffix belongs;

and the forward substring or forward suffix writing module is used for writing the forward substring or the forward suffix into the target data block to which the forward substring or the forward suffix belongs.

The terminal device 500 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 500 may include, but is not limited to, a processor 510, a memory 520. Those skilled in the art will appreciate that fig. 5 is only an example of the terminal device 500, and does not constitute a limitation to the terminal device 500, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 500 may further include an input-output device, a network access device, a bus, etc.

The Processor 510 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 520 may be an internal storage unit of the terminal device 500, such as a hard disk or a memory of the terminal device 500. The memory 520 may also be an external storage device of the terminal device 500, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on, provided on the terminal device 500. Further, the memory 520 may also include both an internal storage unit and an external storage device of the terminal device 500. The memory 520 is used for storing the computer program 521 and other programs and data required by the terminal device 500. The memory 520 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种文本配乐方法、装置、以及计算机存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!