Data analysis system and analysis method for genome base variation detection

文档序号:154862 发布日期:2021-10-26 浏览:35次 中文

阅读说明:本技术 一种用于基因组碱基变异检测的数据分析系统及分析方法 (Data analysis system and analysis method for genome base variation detection ) 是由 刘珍 刘志岩 王海宁 于 2021-07-23 设计创作,主要内容包括:本发明提出了一种用于基因组碱基变异检测的数据分析方法及分析系统,将原始基因组测序的结构数据进行分割为多个子数据序列,并存储到分层分布式数据库中;对存储到分层分布式数据库中的多个子数据序列进行质量过滤,并对过滤后的子数据序列与参考序列进行比对;评估子数据序列和参考序列的相似度,进而获得碱基型的概率;生成目标碱基位置的候选变体可能性的分类评分。(The invention provides a data analysis method and an analysis system for genome base variation detection, which divide structural data of original genome sequencing into a plurality of subdata sequences and store the subdata sequences into a hierarchical distributed database; performing quality filtering on a plurality of subdata sequences stored in a layered distributed database, and comparing the filtered subdata sequences with a reference sequence; evaluating the similarity of the subdata sequence and the reference sequence to further obtain the probability of the base type; generating a classification score of the likelihood of the candidate variant at the base position of interest.)

1. A data analysis method for detecting genomic base variation, comprising the steps of:

s1, dividing the original genome sequenced structure data into a plurality of subdata sequences, and storing the subdata sequences in a layered distributed database;

s2, filtering the quality of the subdata sequences stored in the hierarchical distributed database, and comparing the filtered subdata sequences with the reference sequence;

s3, evaluating the similarity of the subdata sequence and the reference sequence to further obtain the probability of the base type;

s4, generating a classification score of the likelihood of the candidate variant at the target base position.

2. The data analysis method of claim 1, wherein: in the step S1, in the above step,

the method comprises the steps of adopting a characteristic extraction and segmentation method to segment structural data, segmenting according to a set length, using IDs of reads of segmented sequences as indexes, using chain labels, base sequences and quality as values, assembling sequences with the same index into a sub-data sequence, calculating the length of each sub-data sequence, and storing the sub-data sequences with different indexes into a plurality of storage servers in a distributed database based on the length of the sub-data sequences.

3. The data analysis method of claim 1, wherein: the score was calculated using the following Smith-Waterman formula:

M=R*C;

R=a*L2+b;

wherein M represents the Smith-Waterman calculated score, R is the length of the reference sequence, C represents the length of each sub-data sequence, L represents the length of the data stored by the sub-storage server, and a and b represent constants.

4. The data analysis method of claim 3, wherein: the step S3 specifically includes the following steps:

s31, defining an active region, and identifying at least one active region according to the sequence alignment data of the sequencing data;

s32, constructing an assembly graph for each activity area by using the De-Bruijn graph and the hash table and taking the reference sequence as a template, reading the Reads of each segment in turn to match with one segment of the graph, generating all possibly formed paths in a mode of dynamically generating nodes, and screening out paths with good support to represent possible haplotypes;

s33, adding the generated basic data into a probability calculation queue;

s34, using Pair HMM algorithm to realign the sequences, and combining the calculated score M to obtain the probability of each haplotype;

and S35, calculating the probability of the base type by adopting Bayes theorem.

5. The data analysis method of claim 4, wherein: based on the Bayesian total probability formula, in the case of observing data D, the probability of genotype G can be obtained as follows:

wherein, for a common diploid, i is equal to 1, 2, 3, i.e. there is

Gi=H1H2∈{Aa,AA,aa};

P (G) represents the probability of the base type, which is the probability that we calculate by studying the error rate of the reference gene, the probability of the SNP being homozygote, the probability of the SNP being heterozygote, and the ratio of the probability of transition to transversion.

P (D/G) is the probability that the observed data D will appear given the genotype G, and is the haploid likelihood value P (D) determined in step 3j/Hn) Substituting the following equation yields:

6. the data analysis method of claim 2, wherein: and judging the corresponding index of the newly added sequence according to the chain label, the base sequence or the quality as a value, adding the corresponding subdata sequence, and storing the subdata sequence in a storage server in a distributed database.

7. A data analysis system for implementing the data analysis method of any one of claims 1 to 6, characterized in that: the method comprises the following steps:

the structure data acquisition unit is used for acquiring the structure data of genome sequencing in the original FASTQ file;

the dividing and storing unit is electrically connected with the structural data acquiring unit and is used for dividing the structural data into a plurality of subdata sequences and storing the subdata sequences into a hierarchical distributed database;

the sequence comparison unit is electrically connected with the layered distributed database and is used for filtering the quality of a plurality of subdata sequences stored in the layered distributed database and comparing the filtered subdata sequences with a reference sequence;

the probability evaluation unit is electrically connected with the sequence comparison unit and is used for evaluating the similarity between the subdata sequence and the reference sequence so as to obtain the genotyping probability;

a classification scoring unit, electrically connected to the probability evaluation unit, for generating a variant likelihood classification score for the target base position.

8. The data analysis system of claim 7, wherein: the sequence alignment unit also comprises a calibration subunit, which is used for evaluating the alignment quality of the subdata sequences and calibrating the correctness of the subdata sequences.

9. The data analysis system of claim 7, wherein: the classification scoring unit outputs classification scores of the variant possibilities by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for integrating the features and a classifier connected finally.

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a data analysis system and method for detecting genome base variation.

Background

The development of the second generation sequencing technology brings revolutionary breakthrough to the field of life science research, so that researchers can quickly and conveniently acquire genome sequence data, and unprecedented opportunities are provided for understanding life mechanisms and realizing precise medical treatment. However, how to rapidly analyze these massive sequencing data to serve clinical diagnosis and treatment is an urgent problem for biological researchers.

At present, on a cluster server generally configured, human genome data with a sequencing depth of 30X is calculated, which usually requires 30-40 hours and cannot keep up with the generation speed of sequencing data, so that the analysis efficiency of the human genome data needs to be improved.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a data analysis method for detecting genomic base variation, comprising the steps of:

s1, dividing the original genome sequenced structure data into a plurality of subdata sequences, and storing the subdata sequences in a layered distributed database;

s2, filtering the quality of the subdata sequences stored in the hierarchical distributed database, and comparing the filtered subdata sequences with the reference sequence;

s3, evaluating the similarity of the subdata sequence and the reference sequence to further obtain the probability of the base type;

s4, generating a classification score of the likelihood of the candidate variant at the target base position.

Further, in step S1, the structure data is divided by a feature extraction and division method, the structure data is divided according to a set length, the sequences having the same index are assembled into a sub-data sequence by using the IDs of the reads of the divided sequences as an index and the chain label, the base sequence, and the quality as values, the length of each sub-data sequence is calculated, and the sub-data sequences having different indices are stored in a plurality of storage servers in a distributed database based on the length of the sub-data sequence.

Further, the score was calculated using the following Smith-Waterman formula:

M=R*C;

R=a*L2+b;

wherein M represents the Smith-Waterman calculated score, R is the length of the reference sequence, C represents the length of each sub-data sequence, L represents the length of the data stored by the sub-storage server, and a and b represent constants.

Further, the step S3 specifically includes the following steps:

s31, defining an active region, and identifying at least one active region according to the sequence alignment data of the sequencing data;

s32, constructing an assembly graph for each activity area by using the De-Bruijn graph and the hash table and taking the reference sequence as a template, reading the Reads of each segment in turn to match with one segment of the graph, generating all possibly formed paths in a mode of dynamically generating nodes, and screening out paths with good support to represent possible haplotypes;

s33, adding the generated basic data into a probability calculation queue;

s34, using Pair HMM algorithm to realign the sequences, and combining the calculated score M to obtain the probability of each haplotype;

and S35, calculating the probability of the base type by adopting Bayes theorem.

Further, based on the bayesian total probability formula, in the case of observing the data D, the probability of genotype G can be obtained as follows:

wherein, for a common diploid, i is equal to 1, 2, 3, i.e. there is

Gi=H1H2∈{Aa,AA,aa};

P (G) represents the probability of the base type, which is the probability that we calculate by studying the error rate of the reference gene, the probability of the SNP being homozygote, the probability of the SNP being heterozygote, and the ratio of the probability of transition to transversion.

P (D/G) is the probability that the observed data D will appear given the genotype G, and is the haploid likelihood value P (D) determined in step 3j/Hn) Substituting the following equation yields:

further, for the newly added sequence, the corresponding index is judged according to the chain label, base sequence or quality as a value, so as to add the corresponding subdata sequence and store the subdata sequence in a storage server in a distributed database.

The invention also provides a data analysis system for implementing the data analysis method, which comprises the following steps:

the structure data acquisition unit is used for acquiring the structure data of genome sequencing in the original FASTQ file;

the dividing and storing unit is electrically connected with the structural data acquiring unit and is used for dividing the structural data into a plurality of subdata sequences and storing the subdata sequences into a hierarchical distributed database;

the sequence comparison unit is electrically connected with the layered distributed database and is used for filtering the quality of a plurality of subdata sequences stored in the layered distributed database and comparing the filtered subdata sequences with a reference sequence;

the probability evaluation unit is electrically connected with the sequence comparison unit and is used for evaluating the similarity between the subdata sequence and the reference sequence so as to obtain the genotyping probability;

a classification scoring unit, electrically connected to the probability evaluation unit, for generating a variant likelihood classification score for the target base position.

Furthermore, the sequence alignment unit further comprises a calibration subunit, which is used for evaluating the alignment quality of the subdata sequences and calibrating the correctness thereof.

Further, the classification scoring unit outputs a classification score of the variant likelihood by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for feature integration and a finally-connected classifier.

The noun explains:

a gene refers to a DNA or RNA sequence carrying genetic information (i.e., a gene is a DNA or RNA fragment having a genetic effect), also called a genetic element, which is the basic genetic unit for controlling a trait. The gene expresses the genetic information carried by the gene by guiding the synthesis of protein, thereby controlling the character expression of the organism individual.

Gene sequencing is a novel gene detection technology, and the complete sequence of genes is analyzed and determined from blood or saliva, so that the possibility of suffering from various diseases, the behavior characteristics and the behavior reasonableness of an individual and the like are predicted.

reads: the method is a short sequencing fragment, is sequencing data generated by a high-throughput sequencer, can generate hundreds of thousands of reads when sequencing the whole genome, and can obtain the complete sequence of the genome by splicing the reads.

And (3) sequence alignment analysis: the short sequences (reads) sequenced by NGS are stored in FASTQ files, and although they are originally from an ordered genome, the tandem relationship between different reads in the files is completely lost after DNA banking and sequencing. Therefore, there is no positional relationship between the two reads immediately following the FASTQ file, and they are both random from the short sequence at a position in the original genome. Therefore, we need to sort out the large stack of short sequences, compare them with the reference genome of the species one by one, find the position of each read on the reference genome, and then arrange them in order, which is called alignment of sequencing data.

And (3) comparison algorithm: the computational methods for sequence alignment are generally divided into two categories: global alignments (global alignments) and local alignments (local alignments). Computing a global route is a form of global optimization that forces all query sequences to align in terms of the entire length. In contrast, local alignments only determine local similarity while the entire long sequence is often quite different. Local alignments tend to be desirable, but may be more difficult to compute because of challenges from determining other similar regions. Various computational algorithms have been applied to the sequence alignment problem, including slow but regular optimization methods like dynamic programming, efficient but not thorough heuristic algorithms, or probabilistic methods of large database search design.

Smith-waterman is an algorithm for performing local sequence alignments (as opposed to global alignments) to find regions of similarity between two nucleotide or protein sequences. The objective of this algorithm is not to align the complete sequences, but to find the fragments with high similarity in both sequences.

Hash table, also called hash algorithm, hash function, is a method to create a small digital "fingerprint" from any kind of data. The hash function compresses a message or data into a digest so that the amount of data becomes small, fixing the format of the data. This function mixes the data in a hash, recreating a fingerprint called a hash value (hash sums, or hashes). The hash value is typically represented by a short string of random letters and numbers. Good hash functions rarely have hash collisions in the input domain. In hash tables and data processing, data is distinguished without suppressing conflicts, making database records more difficult to find.

Drawings

FIG. 1 is a flow chart of a data analysis method for detecting genomic structural variation according to the present invention;

FIG. 2 is a detailed flowchart of step S3 according to the present invention;

FIG. 3 is a schematic diagram of a data analysis system according to the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

In order to solve the above problems, the present invention provides a data analysis system and an analysis method for detecting genomic structural variation.

Referring to fig. 1, a flow chart of a data analysis method for detecting genomic structural variation according to the present invention is shown, and the data analysis method mainly includes four main processes: establishing a distributed storage database, and segmenting and storing the structural data of the original genome sequencing; sequence alignment of sequencing data; probability calculation for genotyping and classification scoring.

Firstly, in the process of establishing a distributed storage database and segmenting and storing the structural data of original genome sequencing, the method specifically comprises the following steps:

acquiring structural data of genome sequencing in an original FASTQ file;

based on a characteristic extraction and segmentation principle, segmenting the structural data into a plurality of subdata sequences;

it should be noted that the segmentation and calculation of genome structure data must ensure computational accuracy in a biological sense while regulating system load balancing. For example, before mutation detection, data segmentation is effectively segmented according to biological bases such as chromosome regions and the like, the balance degree of calculation load is adjusted, and the overlapping distance is reasonably set, so that information loss cannot be caused by data segmentation. Data transmission efficiency is improved by performing reasonable data slicing in a high-throughput computing environment.

Specifically, a MapReduce frame is used for correspondingly segmenting structural data, a feature extraction (FX) segmentation method is adopted when the structural data is segmented, the method divides double-chain data of a sequencing sequence according to the length set by Hadoop by rewriting a RecordReader abstract class, sends the double-chain data to different maps for processing, uses the ID of reads of the segmented sequence as an index Key in each Map, uses a chain label, a base sequence and quality as values of Value, assembles the sequences with the same index Key into a sub-data sequence by using the sort function of Reduce after segmentation, calculates the length of each sub-data sequence, and stores the sub-data sequences with different index keys into a plurality of storage servers in a distributed database based on the length of the sub-data sequence.

Because the length of each sub data sequence has a certain difference, when the data is stored, different sub data sequences are distributed and stored to different storage servers according to the principle of average distribution as much as possible, so as to ensure load balance.

In a preferred embodiment, the total length of the plurality of sub data sequences allocated by each storage server is the same.

And judging which subdata sequence of the index Key belongs to the newly added sequence according to the Value of the chain label, the base sequence or the quality, adding the corresponding subdata sequence, and storing the subdata sequence in a storage server in a distributed database.

And finishing the building and storing process of the distributed database after all the subdata sequences are written into the distributed database.

In the preferred embodiment, in the above data structure partitioning and storing manner, if a certain specific node in the structure data has more associated nodes, the minimum cutting principle can be preferentially adopted, and the relationship between the specific node and the associated nodes is not cut, so that the specific node and the associated nodes and the relationship thereof are all divided into the same storage server, thereby ensuring that the data in one server node has relatively strong internal relevance; and has good expansibility and throughput capability.

Next, a sequence alignment of the sequencing data is required, specifically including:

and sending the well-divided sequencing structure data stored on the distributed database to each slave machine of the cluster for processing. In a preferred embodiment, the sequencing structure data is subjected to quality filtering before sequence alignment, reads with lower quality are filtered out so as to remove interference data in the sequencing structure data, and then the filtered sequences are aligned by using an alignment tool.

And correspondingly dividing the reference sequences according to the chromosome set and respectively establishing indexes. When bowtie2 is used for alignment, the sequencing structure sequence is respectively aligned with the reference sequence of each chromosome group to obtain the alignment result of the sequencing structure sequence and each chromosome reference sequence, and the alignment result is stored in the corresponding result list of the distributed database.

The sequence alignment process can obtain a series of statistics including the total number of aligned reads, the number of mismatches, the gap size, etc., which can be used to assess the quality of the data and calibrate its correctness.

In the sequence comparison process, a Smith-Waterman algorithm is used for scoring the sub-data sequences, and the scoring results in batches are integrated to obtain a final comparison result so as to realize the fine comparison of the sequencing structure sequence and the reference genome;

the score was calculated using the following Smith-Waterman formula:

M=R*C

R=a*L2+b

wherein M represents the score calculated by Smith-Waterman, R is the length of the reference sequence, C represents the length of each subdata sequence, L represents the length of the data stored by each storage server, and a and b represent constants.

Then, the similarity between the sequencing structure sequence and the reference sequence is evaluated, and the probability of possible base type is obtained.

In particular, with reference to fig. 2, implementation details are set forth:

step S31, defining active regions, and identifying at least one active region according to the sequence alignment data of the sequencing data;

after alignment with the reference genome, the active regions with high probability of variation are identified by traversing the sequencing structure data and transmitted to the next step. In the expected noise level range, the region without variation will be skipped directly in the subsequent step to speed up the analysis.

In step S320, by applying the De-Bruijn graph and the hash table, an assembly graph can be constructed for each active region using the reference sequence as a template. And then, reading each section of Reads in sequence to match with one section of the chart, generating all possible paths by dynamically generating nodes, and screening paths with good support to represent possible haplotypes.

After obtaining the possible haplotypes of an organism, we need to make an assessment of the reliability of their different haplotypes, step S330. In haplotype detection, the Pair HMM algorithm (hidden Markov matching algorithm) is usually used to realign the sequences, and the probability of each haplotype is obtained by combining the data quality information.

Step S340, calculating the probability of the base type by adopting Bayes theorem;

specifically, based on the bayesian total probability formula, in the case of observing data D, the probability of genotype G can be obtained as follows:

wherein, for a common diploid, i is equal to 1, 2, 3, i.e. there is

Gi=H1H2∈{Aa,AA,aa};

P (G) represents the probability of the base type, which is the probability that we calculate by studying the error rate of the reference gene, the probability of the SNP being homozygote, the probability of the SNP being heterozygote, and the ratio of the probability of transition to transversion.

P (D/G) is the probability that the observed data D will appear given the genotype G, and is the haploid likelihood value P (D) determined in step 3j/Hn) Substituting the following equation yields:

thus we have the probability of each possible base type for each site given the observations.

Finally, classification scoring is performed.

The probability of each base type is a number that represents a particular attribute of a candidate mutation site. The probability sequence values are provided directly to the deep convolutional neural network model.

As shown, the deep convolutional neural network model structure includes a characterization layer of the input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully connected layer for feature integration, and a finally connected classifier.

Inputting the probability sequence into a characterization layer, and extracting an upper convolution layer and a lower convolution layer of the features to give a plurality of vector sequences; the full-connection layer combines a plurality of vector sequences given by the upper convolution layer and the lower convolution layer, namely coding or constructing;

the classification layer generates a classification score for the likelihood of whether each candidate variant at the target base position is a true variant or a false variant based on the fully-connected layer encoding or constructed structure.

In other embodiments, the classification layer can generate a classification score for the likelihood that each candidate variant of the target base position is a homozygous variant, a heterozygous variant, a non-variant, or a complex variant.

The deep convolutional neural network model requires training of the classifier according to a larger training set, so that its performance can be further improved.

As shown in fig. 3, the data analysis system for implementing the data analysis method of the present invention includes:

the structure data acquisition unit is used for acquiring the structure data of genome sequencing in the original FASTQ file;

the dividing and storing unit is electrically connected with the structural data acquiring unit and is used for dividing the structural data into a plurality of subdata sequences and storing the subdata sequences into a hierarchical distributed database;

the sequence comparison unit is electrically connected with the distributed database and is used for filtering the quality of a plurality of subdata sequences stored in the distributed database and comparing the filtered subdata sequences;

in a preferred embodiment, the sequence alignment unit further comprises a calibration subunit, configured to evaluate the alignment quality of the subdata sequences and calibrate the correctness thereof;

the probability evaluation unit is electrically connected with the sequence comparison unit and is used for evaluating the similarity between the subdata sequence and the reference sequence so as to obtain the genotyping probability;

a classification scoring unit, electrically connected to the probability evaluation unit, for generating a variant likelihood classification score for the target base position.

The classification scoring unit outputs classification scores of the variant possibilities by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for integrating the features and a classifier connected finally.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于SNP芯片的阈性状基因组育种值估计方法及应用

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!