Multistage clinical genome variation data storage method based on bidirectional rapid indexing

文档序号:1557897 发布日期:2020-01-21 浏览:22次 中文

阅读说明:本技术 一种基于双向快速索引的多级临床基因组变异数据存储方法 (Multistage clinical genome variation data storage method based on bidirectional rapid indexing ) 是由 李昊旻 段会龙 舒强 董聪 吴鼎文 于 2019-09-02 设计创作,主要内容包括:本发明公开了一种基于双向快速索引的多级临床基因组变异数据存储方法,包括:一级存储,建立文件索引数据表,依据所述文件索引数据表存储原始的VCF文件;二级存储,依据原始的VCF文件结构和所述文件索引数据表中的上下文信息,将原始的VCF文件转换并存储为数据库表;三级存储,建立患者-变异双向索引机制,其中第一索引以患者为主键,每个患者主键对应一个变异长二进制数用于索引所有已定义的变异,第二索引以变异为主键,每个变异主键对应一个患者长二进制数用于索引所有的患者,依据所述患者-变异双向索引机制进行数据存储。本发明可快速获取所需的患者信息和变异信息,显著提高信息检索效率。(The invention discloses a multistage clinical genome variation data storage method based on bidirectional rapid indexing, which comprises the following steps: the method comprises the steps of primary storage, establishing a file index data table, and storing an original VCF file according to the file index data table; secondary storage, converting and storing the original VCF file into a database table according to the original VCF file structure and the context information in the file index data table; and three-level storage, namely establishing a patient-variation bidirectional index mechanism, wherein the first index takes patients as primary keys, each primary key of the patients corresponds to one variation long binary number for indexing all defined variations, the second index takes variations as primary keys, each primary key of the variations corresponds to one patient long binary number for indexing all patients, and data storage is carried out according to the patient-variation bidirectional index mechanism. The invention can quickly acquire the required patient information and the variation information and obviously improve the information retrieval efficiency.)

1. A multi-level clinical genomic variation data storage method based on bidirectional rapid indexing is characterized by comprising the following steps:

the method comprises the steps of primary storage, establishing a file index data table, and storing an original VCF file according to the file index data table;

secondary storage, converting and storing the original VCF file into a database table according to the original VCF file structure and the context information in the file index data table;

and three-level storage, namely establishing a patient-variation bidirectional index mechanism, wherein the first index takes patients as primary keys, each primary key of the patients corresponds to one variation long binary number for indexing all defined variations, the second index takes variations as primary keys, each primary key of the variations corresponds to one patient long binary number for indexing all patients, and data storage is carried out according to the patient-variation bidirectional index mechanism.

2. The bi-directional fast indexing-based multi-level clinical genomic variation data storage method of claim 1, wherein the secondary storage is set to a limited time period or limited case mode, and the excess time period or the excess amount of data is automatically cleared after the set time period length or the number of cases is exceeded.

3. The bi-directional fast indexing-based multi-stage clinical genomic variation data storage method of claim 1, wherein the patient-variation bi-directional indexing mechanism employs two data tables, namely a patient data table with patient as a main index and a variation data table with variation as a main index;

the patient data table comprises the name, the sex and the birth date of the patient and corresponds to a variation index field of a long binary number;

the variant data table contains variant types and genomic locations and corresponds to a patient index field of one long binary number.

4. The multistage clinical genomic variation data storage method based on bidirectional rapid indexing as claimed in claim 3, wherein the patient primary key and the variation primary key both adopt integer data, and the patient-variation bidirectional indexing mechanism adopts binary bit as identification;

each bit in the variation index field represents a defined variation;

each bit in the patient index field represents a patient.

5. The bi-directional fast indexing-based multi-stage clinical genomic variation data storage method of claim 4, wherein on each bit in the variation index field, 0 represents that the patient does not have the variation and 1 represents that the patient has the variation;

on each bit in the patient index field, 0 represents that a mutation does not occur on the patient and 1 represents that a mutation occurs on the patient.

6. The method for storing the multi-stage clinical genomic variation data based on the bidirectional rapid index as claimed in claim 4 or 5, wherein any one bit in the variation index field is converted into integer data, and the variation information is obtained after corresponding to the variation primary key equal to the integer data;

and converting any bit in the patient index field into integer data, and acquiring the information of the patient after corresponding to the patient key which is equal to the bit.

7. The bi-directional fast indexing-based multi-stage clinical genomic variation data storage method according to claim 4 or 5, wherein the length of the variation index field is determined according to the number of defined variations;

the length of the patient index field is determined according to the number of patients.

8. The bi-directional fast indexing-based multi-level clinical genomic variation data storage method according to claim 1, wherein the primary storage adopts a synchronous mode, and the secondary storage and the tertiary storage adopt an asynchronous background storage mode.

Technical Field

The invention relates to the technical field of data storage, in particular to a multi-level clinical genomic variation data storage method based on bidirectional rapid indexing.

Background

Genome sequencing is a novel gene detection technology, and can analyze and determine the complete sequence of genes from blood or saliva, and predict the possibility of suffering from various diseases, and the behavior characteristics and behaviors of individuals are reasonable. The gene sequencing technology can lock the individual pathological change gene and prevent and treat in advance.

Nowadays, the development of precise medicine is promoted by the faster and faster speed and the gradually reduced cost of genome sequencing, and more gene sequencing data are beginning to serve the services of clinical diagnosis, selection of treatment schemes and the like, and especially play more and more important roles in the aspects of genetic diseases, tumor treatment and the like.

Genome sequencing technology has generated a variety of entirely new types of clinical data. Currently, gene testing based on next generation sequencing technology will generate files in various formats, such as sequencing result FASTQ, reference alignment file BAM, mutation file VCF, etc., among which the main files serving the clinic are VCF files.

VCF is a text file describing the results of SNP (variation on a single base), INDEL (insertion deletion marker) and SV (structural variation site), and is typically 500-750M in size. These variant results are currently serving clinical geneticists to generate clinical reports, but these data are often stored in file form, lacking the ability to perform cross-patient retrieval, statistics, and reuse in an appropriate manner.

To solve these problems, it is usually necessary to import VCF files into the conventional database, but if the VCF files are stored in the conventional relational database structure according to the structure of the VCF format itself to serve data retrieval and statistics, a big problem is faced: since each VCF file may contain tens to hundreds of thousands of entries, with the increasing number of clinical sequencing subjects, the number of rows in a single table will often easily break through hundreds of millions of entries when the system contains tens of thousands of patients, and thus table data of such a size will be increasingly inefficient or even unusable in query efficiency.

However, after extensive clinical sequencing, the number of patients that typically need to be managed can break through the above numbers each year, and thus traditional VCF table structured direct data table storage would become infeasible.

Therefore, another long-term available, efficient and genome variation data storage scheme capable of realizing rapid cross-patient retrieval and statistical capability is needed to construct a genome archiving system capable of serving clinics.

The multi-level storage mode can realize efficient retrieval and utilization under different application scenes by utilizing different storage forms and index structures, can be applied to storage of clinical genome variation data, and improves the efficiency of data service.

Disclosure of Invention

Aiming at the defects in the field, the invention provides a multi-stage clinical genome variation data storage method based on bidirectional rapid indexing, which can rapidly acquire required patient information and variation information and remarkably improve the information retrieval efficiency.

A multi-level clinical genomic variation data storage method based on bidirectional rapid indexing comprises the following steps:

the method comprises the steps of primary storage, establishing a file index data table, and storing an original VCF file according to the file index data table;

secondary storage, converting and storing the original VCF file into a database table according to the original VCF file structure and the context information in the file index data table;

and three-level storage, namely establishing a patient-variation bidirectional index mechanism, wherein the first index takes patients as primary keys, each primary key of the patients corresponds to one variation long binary number for indexing all defined variations, the second index takes variations as primary keys, each primary key of the variations corresponds to one patient long binary number for indexing all patients, and data storage is carried out according to the patient-variation bidirectional index mechanism.

Preferably, the secondary storage is set to a limited time period or limited case mode, and the data of the overdue or excessive amount is automatically cleared after the set time period length or the number of cases are exceeded.

Preferably, the patient-variation bidirectional indexing mechanism adopts two data tables, namely a patient data table with a patient as a main index and a variation data table with a variation as a main index;

the patient data table comprises the name, the sex and the birth date of the patient and corresponds to a variation index field of a long binary number;

the variant data table contains variant types and genomic locations and corresponds to a patient index field of one long binary number.

Preferably, the patient key and the mutation key both adopt integer data, and the patient-mutation bidirectional index mechanism adopts bit of binary number as identification;

each bit in the variation index field represents a defined variation;

each bit in the patient index field represents a patient.

Preferably, on each bit in the mutation index field, 0 represents that the patient does not have the mutation, and 1 represents that the patient has the mutation;

on each bit in the patient index field, 0 represents that a mutation does not occur on the patient and 1 represents that a mutation occurs on the patient.

Preferably, any bit in the mutation index field is converted into integer data, and after the integer data corresponds to a mutation primary key equal to the mutation primary key, the information of the mutation is acquired;

and converting any bit in the patient index field into integer data, and acquiring the information of the patient after corresponding to the patient key which is equal to the bit.

Preferably, the length of the mutation index field is determined according to the number of the defined mutations;

the length of the patient index field is determined according to the number of patients.

Preferably, the primary storage adopts a synchronous mode, and the secondary storage and the tertiary storage adopt an asynchronous background storage mode.

Compared with the prior art, the invention has the main advantages that:

(1) the invention can realize quick cross-patient retrieval and statistics, reduces the scale of the data table to thousands of times compared with the scale of the original VCF entry, and can basically adapt to the requirements of wide clinical application.

(2) The patient-variant bidirectional index mode of the invention returns only one long binary number, which can remarkably improve the query efficiency and the transmission efficiency of the query result, thereby providing better data service.

(3) According to the invention, when receiving the storage service, the asynchronous background storage mode is adopted for subsequent storage except for the synchronous mode adopted at the first stage, so that the user end stop response is not required, and the user experience of the system is ensured.

Drawings

FIG. 1 is a schematic flow chart of a multi-stage clinical genomic variation data storage method based on bidirectional fast indexing according to the present invention;

FIG. 2 is a graph comparing the efficiency of the secondary storage mode in the test case with the patient-based variation index mode of the example.

Detailed Description

The invention is further described with reference to the following drawings and specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. The following examples are conducted under conditions not specified, usually according to conventional conditions, or according to conditions recommended by the manufacturer.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用以实现智慧医疗的高端电子病历系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!