Biological gene sequence summary data generation method and system

文档序号：96767 发布日期：2021-10-12 浏览：19次中文

阅读说明：本技术 一种生物基因序列的概要数据生成方法及系统 (Biological gene sequence summary data generation method and system ) 是由刘卫国林浩然徐晓明殷泽坤于 2021-05-20 设计创作，主要内容包括：本公开提供了一种生物基因序列的概要数据生成方法及系统,获取待处理的基因序列；将待处理的基因序列利用滑动窗口进行K-mer分解,将M个K-mer及其对应的M个反向互补链的K-mer进行对比,对每对正向、反向K-mer,选取字符值较小的K-mer,最终得到M个K-mer,然后进行向量化转置操作；将转置操作得到的向量输入到基于单指令多数据流SIMD改进的哈希函数中,得到各个向量对应的哈希值；继续滑动窗口获取新的子序列K-mer,重复上述操作,直到待处理基因序列的所有K-mer都计算出对应的哈希值,根据所有的哈希值构建待处理基因序列的哈希值列表；根据哈希值列表,生成待处理的基因序列的概要数据；本公开采用向量化的实现方式,计算速度更快,能够实现生物基因序列的更高效处理。(The disclosure provides a method and a system for generating summary data of a biological gene sequence, which are used for acquiring a gene sequence to be processed; performing K-mer decomposition on a gene sequence to be processed by using a sliding window, comparing M K-mers and K-mers of corresponding M reverse complementary chains, selecting the K-mer with a smaller character value for each pair of forward and reverse K-mers to finally obtain M K-mers, and then performing vectorization transposition operation; inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors; continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values; generating summary data of the gene sequence to be processed according to the hash value list; the method adopts a vectorization implementation mode, has higher calculation speed, and can realize more efficient processing of the biological gene sequence.)

1. A method for generating outline data of a biological gene sequence, comprising: the method comprises the following steps:

obtaining a gene sequence to be processed;

performing K-mer decomposition on a gene sequence to be processed by using a sliding window, sequentially cutting a fixed-length K-mer each time, obtaining a reverse complementary chain of the gene sequence, respectively encapsulating M K-mers and the K-mers of the reverse complementary chain into vectors, performing vectorization comparison on the forward K-mers and the reverse K-mers in a binary mask mode, leaving K-mers with smaller character values, and performing vectorization transposition operation on the left M K-mers;

inputting the vectors obtained by the transposition operation into a hash function improved based on Single Instruction Multiple Data (SIMD) to obtain hash values corresponding to the vectors;

continuously sliding the window to obtain new subsequence K-mers, repeating the operation until all K-mers of the gene sequence to be processed calculate corresponding hash values, and constructing a hash value list of the gene sequence to be processed according to all the hash values;

and generating summary data of the gene sequence to be processed according to the hash value list.

2. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

dividing the input transposed M vectors again by taking the integer multiple of 32 bits or 64 bits as a unit, and dividing each vector into a group of vectorized bit operation operations when each vector is full of N bits, wherein N is the integer multiple of 32 or 64;

after all the parts with the full N bits are processed, finally carrying out independent vectorization bit operation on the residual tail parts with less than N bits;

and integrating the operation results to obtain a result vector, and analyzing the obtained result vector through corresponding instructions to obtain hash values of the M sequences.

3. The method for generating summary data of a biological gene sequence according to claim 1, wherein:

loading M K-mer data into M vectors, and performing transposition operation by using a vectorization mode;