Biological gene sequence summary data generation method and system

文档序号:96767 发布日期:2021-10-12 浏览:19次 中文

阅读说明:本技术 一种生物基因序列的概要数据生成方法及系统 (Biological gene sequence summary data generation method and system ) 是由 刘卫国 林浩然 徐晓明 殷泽坤 于 2021-05-20 设计创作,主要内容包括:本公开提供了一种生物基因序列的概要数据生成方法及系统,获取待处理的基因序列;将待处理的基因序列利用滑动窗口进行K-mer分解,将M个K-mer及其对应的M个反向互补链的K-mer进行对比,对每对正向、反向K-mer,选取字符值较小的K-mer,最终得到M个K-mer,然后进行向量化转置操作;将转置操作得到的向量输入到基于单指令多数据流SIMD改进的哈希函数中,得到各个向量对应的哈希值;继续滑动窗口获取新的子序列K-mer,重复上述操作,直到待处理基因序列的所有K-mer都计算出对应的哈希值,根据所有的哈希值构建待处理基因序列的哈希值列表;根据哈希值列表,生成待处理的基因序列的概要数据;本公开采用向量化的实现方式,计算速度更快,能够实现生物基因序列的更高效处理。(The disclosure provides a method and a system for generating summary data of a biological gene sequence, which are used for acquiring a gene sequence to be processed; performing K-mer decomposition on a gene sequence to be processed by using a sliding window, comparing M K-mers and K-mers of corresponding M reverse complementary chains, selecting the K-mer with a smaller character value for each pair of forward and reverse K-mers to finally obtain M K-mers, and then performing vectorization transposition operation; inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors; continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values; generating summary data of the gene sequence to be processed according to the hash value list; the method adopts a vectorization implementation mode, has higher calculation speed, and can realize more efficient processing of the biological gene sequence.)

1. A method for generating outline data of a biological gene sequence, comprising: the method comprises the following steps:

obtaining a gene sequence to be processed;

performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively encapsulating M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers in a binary mask mode, leaving K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

and generating summary data of the gene sequence to be processed according to the hash value list.

2. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

dividing the input transposed M vectors again by taking the integer multiple of 32 bits or 64 bits as a unit, and dividing each vector into a group of vectorized bit operation operations when each vector is full of N bits, wherein N is the integer multiple of 32 or 64;

after all the parts with the full N bits are processed, finally carrying out independent vectorization bit operation on the residual tail parts with less than N bits;

and integrating the operation results to obtain a result vector, and analyzing the obtained result vector through corresponding instructions to obtain hash values of the M sequences.

3. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

loading M K-mer data into M vectors, and performing transposition operation by using a vectorization mode;

alternatively, the first and second electrodes may be,

replacing the switch statement by adopting a corresponding mask for the data of the tail part;

alternatively, the first and second electrodes may be,

and in the hash function calculation process, filling by using multipath expansion parallel operation.

4. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

and generating summary data by adopting a Minhash sketch algorithm.

5. The method for generating summary data of a biological gene sequence according to claim 4, wherein:

before the K-mer is input into the hash function, a bit operation mode is adopted to replace a switch statement in the comparison process of the forward Kmer and the reverse Kmer;

alternatively, the first and second electrodes may be,

before the K-mer is input into the hash function, obtaining a K-mer comparison algorithm based on SIMD improvement, and comparing by using a binary mask and a vectorization instruction;

alternatively, the first and second electrodes may be,

after the hash value list is generated, a robin-hood-hashing method is adopted to replace an unordered _ map method in the implementation process of the priority queue.

6. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

the summary data was generated using the Hyperlogog sketch algorithm.

7. The method for generating summary data of a biological gene sequence according to claim 6, wherein:

adopting OpenMP multithreading at the distance stage of the sketch generation stage;

alternatively, the first and second electrodes may be,

for union sets, arithmetic and bit operations are performed on vectors of multiple adjacent operands in an inner loop using single instruction multiple data streams.

8. A system for generating summary data of a biological gene sequence, comprising: the method comprises the following steps:

a data acquisition module configured to: obtaining a gene sequence to be processed;

a K-mer decomposition module configured to: performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively encapsulating M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers in a binary mask mode, leaving K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

a hash calculation module configured to: inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

a hash value list generation module configured to: continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

a summary data generation module configured to: and generating summary data of the gene sequence to be processed according to the hash value list.

9. A computer-readable storage medium on which a program is stored, the program realizing the steps in the method for generating summary data of a biological gene sequence according to any one of claims 1 to 7 when executed by a processor.

10. An electronic device comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for generating the summary data of a biological gene sequence according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of biological data processing technologies, and in particular, to a method and a system for generating summary data of a biological gene sequence.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the development of sequencing technology, the size of biological gene databases is getting larger. From the beginning that the total amount of public gene library data is less than fifty million nucleotide sequences, more than one trillion sequences can be generated by one sequencing instrument at one time at present, the data scale is increased rapidly, and the data generation capacity of the new sequencing technology exceeds the moore law. For efficient processing of gene data, tools such as Mash, Dashing, etc. were developed in succession. In such tools, processing of gene data translates into processing of a series of common subsequence segments (K-mers, K-length substrings) in a gene sequence. Generally, a uniform and deterministic hash algorithm is applied to map the subsequences into hash values, then the hash values are processed, and a corresponding sketch (i.e., a compact and approximate data summary capable of representing or summarizing original data) is formed by a certain method. After the sketch is formed, the sketch with different sequences is subsequently used for comparison, and Jaccard coefficients, distance and the like can be obtained, so that corresponding research is carried out. And MinHash, HyperLogLog are sketch algorithms of which the comparison is representative.

The Minhash was originally developed to detect near-duplicate web pages and images and was later applied in the field of bioinformatics. It relies on the homogeneity and certainty of the hash function that a series of data for an input needs to be uniformly and randomly distributed within the range of the output data after passing through the hash function, and the same output should be guaranteed for the same input. The MinHash algorithm processes the sequence data by constructing a sketch set for each sequence, and estimating the similarity of the sequences by the Jaccard similarity between the sets. The processing mode of the sequence data is described in more detail as that the sequence is segmented into individual K-mers in a sliding window mode, the K-mers are used as input through a hash function, and the output is a hash value, so that the mapping from the character string to the integer is completed. All the obtained hash values are constructed into a set, and the similarity of the original sequences is represented by calculating the similarity between the sets. However, since the sizes of the sets formed by the hash values mapped by all the K-mers and the lengths of the sequences are in a linear relationship, in order to achieve the purpose of compressing the data volume, when the Jaccard similarity is calculated for the set formed by all the hash values corresponding to the two sequences, if the two sets are constructed by the same hash function, the distribution of the hash values in the sets after hash mapping is random according to the uniformity and the certainty of the hash function, so that the probability that the two sets have the same minimum hash value and the ratio of the size of the intersection of the two sets to the size of the union are equal, and the ratio of the intersection and the union between the two sets is the definition of the Jaccard similarity. This is also the basic idea of MinHash.

The sketch algorithm based on HyperLogLog, which estimates the potential of a collection (number of elements in a collection) based on a data bit pattern (bit-pattern), has also been applied in recent years in biological sequence analysis. The idea of HyperLog is to hash the raw data with a hash function of homogeneity and certainty, and then estimate how many different elements in the set are by adding a value of 1 to the longest leading zero in the hash. When the sketch is constructed, a hash value is calculated for each element K-mer in the sequence, then prefix p bits of the hash value are taken out to serve as indexes in the sketch, the indexes are indexed to corresponding positions of the register array, and the size of each register set is estimated by the remaining q bits according to the number of leading zeros. When a new longer leading zero is obtained, the data in the current register is updated. Finally, the magnitude of the original aggregate potential is estimated by harmonic averaging the values stored by each register.

The inventor finds that in the related applications of the Minhash sketch and the HyperLogLog sketch, the calculation of the hash value is one of the calculation intensive areas, the performance of the whole program is limited by the solution of the hash value, and a solution method with better performance is still lacked at present; meanwhile, before the hash function is calculated, the sorting of the input content K-mers and the processing of the hash value list after the hash function is calculated are the other hot parts of the program, and the existing sketch algorithm is lack of data optimization processing before and after the hash is solved.

Disclosure of Invention

In order to solve the defects of the prior art, the method and the system for generating the summary data of the biological gene sequence are provided by the disclosure.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

in a first aspect, the present disclosure provides a method for generating summary data of a biological gene sequence.

A method for generating summary data of a biological gene sequence, comprising the following processes:

obtaining a gene sequence to be processed;

performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively encapsulating M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers in a binary mask mode, leaving K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

and generating summary data of the gene sequence to be processed according to the hash value list.

Further, dividing the input transposed M vectors again by taking integer multiples of 32 bits or 64 bits as a unit, and dividing each vector into a group of vectorized bit operation operations for every full N bits, wherein N is the integer multiples of 32 or 64;

after all the parts with the full N bits are processed, finally carrying out independent vectorization bit operation on the residual tail parts with less than N bits;

and integrating the operation results to obtain a result vector, and analyzing the obtained result vector through corresponding instructions to obtain hash values of the M sequences.

Furthermore, M pieces of K-mer data are loaded into M vectors, and transposition operation is performed in a vectorization mode.

Further, for the data of tail part, the switch statement is replaced by a corresponding mask.

Furthermore, in the hash function calculation process, filling is performed by using multipath expansion parallel operation.

Further, the summary data is generated by adopting a Minhash sketch algorithm.

Furthermore, before the K-mer is input into the hash function, a bit operation mode is adopted to replace a switch statement in the comparison process of the forward Kmer and the reverse Kmer.

Furthermore, before the K-mer is input into the hash function, a K-mer comparison algorithm is obtained based on SIMD improvement, binary mask is utilized, and vectorization instructions are adopted for comparison.

Furthermore, after the hash value list is generated, a robin-hood-hashing method is adopted to replace the unordered _ map method in the implementation process of the priority queue.

Further, the summary data is generated by using the Hyperlogog sketch algorithm.

Furthermore, OpenMP multithreading is adopted in the sketch generation stage distance stage.

Further, for union sets, arithmetic and bit operations are performed on vectors of multiple adjacent operands in an inner loop using single instruction multiple data streams.

In a second aspect of the present disclosure, a system for generating summary data of a biological gene sequence is provided.

A system for generating summary data of a biological gene sequence, comprising the processes of:

a data acquisition module configured to: obtaining a gene sequence to be processed;

a K-mer decomposition module configured to: performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively encapsulating M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers in a binary mask mode, leaving K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

a hash calculation module configured to: inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

a hash value list generation module configured to: continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

a summary data generation module configured to: and generating summary data of the gene sequence to be processed according to the hash value list.

A third aspect of the present disclosure provides a computer-readable storage medium on which a program is stored, the program, when executed by a processor, implementing the steps in the summary data generation method for a biological gene sequence as described in the first aspect of the present disclosure.

A fourth aspect of the present disclosure provides an electronic device, comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for generating summary data of a biological gene sequence according to the first aspect of the present disclosure.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the method, the system, the medium or the electronic equipment disclosed by the disclosure establishes the hash value list of the gene sequence based on various hash functions improved by single instruction multiple data stream SIMD, adopts a vectorization implementation mode, has higher calculation speed, and can realize more efficient processing of the biological gene sequence.

2. The method, system, medium, or electronic device of the present disclosure adopts a form of loading M data into M vectors, and then transposes the M data by using a vectorization means, which causes a problem that the M data cannot be loaded completely and 32/64 bits at the same position of the M data cannot be obtained.

3. According to the method, the system, the medium or the electronic equipment, the switch statement is replaced by the corresponding mask, the branch prediction failure is avoided, the vectorization operation can be fully utilized, and the program performance is greatly improved.

4. The method, the system, the medium or the electronic equipment disclosed by the disclosure utilize multi-path expansion parallel operation for filling, so that the dependence is reduced, and the performance of the hash function is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a flowchart of a Minhash sketch method in the prior art provided in embodiment 1 of the present disclosure.

Fig. 2 is a flowchart of an improved Minhash sketch method provided in embodiment 1 of the present disclosure.

Fig. 3 is a flowchart of a hyper glog sketch method in the prior art provided in embodiment 1 of the present disclosure.

Fig. 4 is a flowchart of the improved hyperlogog sketch method provided in embodiment 1 of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example 1:

the embodiment 1 of the disclosure provides a summary data generation method of a biological gene sequence, for the calculation of a hash value, the embodiment provides various improved hash functions based on single instruction multiple data stream SIMD, including MurmurHash3, CityHash, xxHash and wangHash, the hash functions are utilized to construct a hash value list of the gene sequence, different hash functions are selected according to different conditions, and the applicability is wider; and the vectorization implementation mode is adopted, so that the speed is higher.

The method for calculating the hash value by the original hash function is as follows:

for sequence data to be processed, a sliding window is utilized to generate a K-mer, then the K-mer is processed to obtain a reverse complementary strand (DNA generally presents a double-stranded structure and is formed by coiling two single strands, the two single strands have complementary characteristics, namely, base pairs are formed between every two bases, the pairing has specificity, namely, A can only form pairing with T, C can only form pairing with G, and the pairing is called complementary pairing of bases, wherein the two complementary single strands are opposite in direction, each single strand is called reverse complementary strand of the other single strand), then the two K-mers are compared to obtain the K-mer with a smaller character value, then the K-mer is transmitted into a hash function to be operated, and the K-mer is divided into a body part and a tail part to be processed respectively. After that, the original algorithm processes the integer multiple of 32/64 bits for the body part each time, and the rest is not enough to constitute the integer multiple of 32/64 bits, and the body part is processed separately as tail part, and then is subjected to the shift operation and then the corresponding bit operation.

For example, a K-mer, K is 21, which is converted into a binary system of 168 bits, assuming that the hash function operates 64 bits each time, the body part is the first 128 bits, 64 bits are processed each time, the tail part is the remaining 40 bits, the body part is shifted to achieve the effect of zero padding and supplementing to 64 bits, and then corresponding bit operation is performed to finally obtain a corresponding hash value.

In this embodiment, a Single Instruction Multiple Data (SIMD) vectorization operation is performed for an existing hash algorithm, which is specifically implemented as follows:

obtaining a gene sequence to be processed;

performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively packaging M K-mers and K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers by adopting a binary mask mode, namely selecting the K-mers with smaller character values for each pair of the forward K-mers and the reverse K-mers, finally leaving M K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

selecting a proper hash function, and inputting a vector obtained by transposition operation into the improved hash function based on Single Instruction Multiple Data (SIMD);

dividing the input transposed M vectors again by taking integer multiples of 32/64 bits as a unit according to the characteristics of different hash functions, wherein the M K-mers are equal in length, so that the M vectors can find a uniform secondary division: each vector is divided into a group of vectorization bit operation operations every full N bits (N is an integral multiple of 32/64);

after all the parts with the full N bits are processed, the last remaining part (namely tail part) with less than N bits is subjected to independent vectorization bit operation;

integrating the operation result by a vectorization technology, analyzing the obtained result vector by corresponding instructions to obtain hash values of M sequences, and storing the hash values;

and continuously sliding the window to obtain new subsequence K-mers, and repeating the operation until all the K-mers of the gene sequence to be processed calculate corresponding hash values, thereby constructing a hash value list of the gene sequence to be processed.

Generally, M is 8, the default length of the K-mer is 21, different lengths of the K-mer can be set for the sequence, and the number of the K-mers cut is different.

The following improvements are also adopted in this embodiment:

(1) while the original hash function receives one data at a time, the SIMD-based hash function requires M data to be processed at a time, and therefore requires the input of data to be merged and loaded into a vector form. And the original hash function calculation process is divided into a body part and a tail part, the tail part is processed independently, and the body part is processed in multiple cycles by using 32/64 bits as a basic operation unit. The SIMD-based hash function must therefore also process M (32/64) bits at a time in the body part, and must be 32/64 bits at the same position in each of the M data, that is to say the first vector loaded should be the first 32/64 bits of the M data. However, if the original vector is loaded, it will load the first data and the second data from the beginning, and since 512 bits are loaded at most, since one data will generally exceed 64 bits, this will result in that M data cannot be loaded completely and 32/64 bits at the same position of M data cannot be obtained. The loaded data must be reconstructed.

In this embodiment, a form of loading M data into M vectors is adopted, and then a transposition operation is performed by using a vectorization means, so that the above problem is solved.

(2) The original hash function processes the data of the tail part separately, and generally adopts a switch-case branch judgment mode, which is likely to bring a large overhead.

In the embodiment, the switch statement is replaced by using the corresponding mask, so that the failure of branch prediction is avoided, the vectorization operation can be fully utilized, and the program performance is greatly improved.

(3) Certain dependency exists in the hash function calculation process, and the dependency can reduce the speed of vectorization instructions. The embodiment utilizes the multi-path expansion parallel operation for filling, so that the dependence is reduced, and the performance of the function is further improved.

In the algorithm process of forming the sketch, the Minhash sketch and the Hyperloglog sketch are consistent with respect to the calculation part of the hash value, and are different in the following parts.

As shown in fig. 1, the subsequent flow of the Minhash sketch original algorithm is as follows:

acquiring a hash value list of a gene sequence to be processed;

selecting a proper sketch size s, processing the hash value list by using a priority queue, and acquiring s minimum hash values to form a sketch of the gene sequence; if the current sketch needs to be updated, the minimum hash value lists obtained twice need to be merged, and the minimum s value is reselected as a new sketch.

Repeating the steps until the gene sequences to be processed form corresponding sketch;

after sketch is formed, the embodiment can be utilized in many ways, for example, comparing the sketch corresponding to two gene sequences, and calculating Jaccard coefficient of the two gene sequences, i.e. calculating the ratio of | S (a £ B) # S (a) # S (B) and | S (a $ B) |, which is an unbiased estimation of Jaccard coefficient.

Wherein S (A) denotes the sketch of gene sequence A, S (B) denotes the sketch of gene sequence B, and S (Au.B) denotes a set formed by the union of two sketches and the S smallest hash values.

And calculating the Jaccard coefficient and the like according to the Minhashsketch so as to judge the similarity. The Sketch method is effectively utilized in the aspects of genome assembly, clustering of metagenome samples, real-time search of gene databases by using sequencing data (spliced and unspliced) of different platforms and the like.

In addition to the optimization of the hash function, the process of sorting the input data of the hash function and outputting the hash value list further includes the following improvements, as shown in fig. 3:

(1) before the K-mer is introduced into the hash function, the reverse complementary strand of the K-mer needs to be solved, and in the process of generating the reverse complementary strand part, the reverse complementary strand is formed by switch-case branch statement prediction in the flow of an original algorithm (such as fig. 1), and the wrong prediction of the part has punishment and brings performance reduction.

The embodiment improves the switch-case branch statement prediction of the original algorithm, and the embodiment adopts a bit operation mode to replace the switch statement, thereby avoiding the failure of branch prediction and greatly improving the generation speed of the reverse complementary chain.

(2) Before the K-mer is transmitted into the hash function, after the reverse complementary chain is obtained, the K-mer needs to be compared to obtain the K-mer with a smaller character value, and then the K-mer is transmitted into the hash function to obtain the hash value. The forward K-mer is compared with its reverse complement K-mer by calling the memcmp function in the original algorithm, thereby leaving the K-mer with a character value less than a predetermined value.

However, the comparison speed one by one is slower, and the embodiment obtains a better-performance K-mer comparison algorithm based on SIMD improvement, namely, as described in the hash function algorithm optimization, the binary mask is used, and a vectorization instruction is used for comparison, so that a better performance acceleration ratio is obtained.

(3) After the hash function generates the hash value list, a robin-hood-hashing method is adopted to replace an unordered _ map method in the implementation process of the priority queue, and the performance of a data structure used in the construction process is improved.

(4) For subsequent applications after sketch generation, if Mash distance is solved, an algorithm based on merge is adopted for an original algorithm, and a block-based idea is adopted in the embodiment, so that the original algorithm is expanded and the speed is higher.

(5) The process of constructing the sketch by different sequences is not dependent on each other, and can be implemented by multiple threads, on a multi-core platform, the number should be set to the maximum number of available CPU cores to obtain the best performance, and in thread affinity and thread binding mode, the embodiment using the compact or scatter mode has better performance than the defult mode.

As shown in fig. 2, the original algorithm flow of HyperLogLog sketch is as follows:

acquiring a hash value list of a gene sequence to be processed;

constructing an HLL array with the size of m, setting the initial value to be 0, processing a hash value list, dividing each hash value into a prefix p and a suffix q, indexing the array position by the prefix p, calculating the LZC (leading 0 counting, namely the number of bits in q from left to right until encountering the leftmost 1) at the position by the suffix q, then comparing the LZC with the stored LZC at the position of the array, reserving a larger LZC value, and finally obtaining the largest LZC value (namely the LZC) at the position of the arraymax. Processing the hash value list to obtain sketch of the HyperLogLog;

repeating the steps until the gene sequences to be processed form corresponding sketch;

after the sketch is formed, the embodiment may estimate and calculate the cardinality of different sets by using the formed sketch, and the formula is:

where E is the estimated radix, m is the size of the array, αmIn order to correct the factor(s),middle MjIs 1+ LZCmaxAnd the similarity of the Jaccardindex of the two gene sequences can be obtained, and the method can be used for a series of clustering and other operations.

For the performance optimization part, in addition to the hash function optimization part described above, the following improvements are included, as shown in fig. 4:

(1) the embodiment uses OpenMP multithreading, forms sketch and finds distance phases of sketch are easy to be expanded to 100 threads, and obtains better speed-up ratio.

(2) In the application part, the optimization based on Single Instruction Multiple Data (SIMD) is realized. In the algorithm, the estimation of the union and intersection cardinality is an important component of similarity measurement, and the optimization can be well accelerated. Thus, for a union set, using single instruction multiple data Streams (SIMD) in the inner loop, fast arithmetic and bit operations can be performed on vectors of multiple adjacent operands, using SIMD instructions in an important loop, and thus speeding up the speed, the use of an instruction set based on AVX512 will be about 20% faster than the SSE 2-based instruction set. The statistical process for set cardinality and set intersection cardinality is difficult to use in this way, using manual loop expansion to speed up these inner loops.

Example 2:

the embodiment 2 of the present disclosure provides a system for generating summary data of a biological gene sequence, including the following processes:

a data acquisition module configured to: obtaining a gene sequence to be processed;

a K-mer decomposition module configured to: performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively packaging M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mer and the reverse K-mer by adopting a binary mask form and a binary mask form, namely selecting the K-mer with a smaller character value for each pair of the forward and reverse K-mers, finally leaving M K-mers with a smaller character value, and performing vectorization transposition operation on the left M K-mers;

a hash calculation module configured to: inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

a hash value list generation module configured to: continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

a summary data generation module configured to: and generating summary data of the gene sequence to be processed according to the hash value list.

The working method of the system is the same as the method for generating the summary data of the biological gene sequence provided in example 1, and the description is omitted here.

Example 3:

the present disclosure embodiment 3 provides a computer-readable storage medium on which a program is stored, which when executed by a processor, implements the steps in the method for generating summary data of a biological gene sequence as described in the present disclosure embodiment 1.

Example 4:

the embodiment 4 of the present disclosure provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, and the processor executes the program to implement the steps in the method for generating summary data of a biological gene sequence according to the embodiment 1 of the present disclosure.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:产品成分标识生成方法、装置、计算机设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!