Biological information analysis method of ATAC-seq sequencing data

文档序号:1478004 发布日期:2020-02-25 浏览:8次 中文

阅读说明:本技术 一种ATAC-seq测序数据的生物信息分析方法 (Biological information analysis method of ATAC-seq sequencing data ) 是由 夏昊强 周煌凯 高川 陶勇 罗玥 程祖福 邢燕 曾川川 于 2019-11-05 设计创作,主要内容包括:本发明提供一种ATAC-seq测序数据的生物信息分析方法,包括以下步骤:对ATAC-seq测序数据进行分析与质控;将分析与质控后的数据与参考基因组进行比对分析;对单样本Peak进行检测与统计;组内一致性peak提取与分析;各处理组peak合并以及多样本聚类分析;各处理组间共有和特有peak分析;组间peak丰度差异分析。本发明围绕常规ATAC-seq测序数据进行研究,构建了一个分析内容丰富的,能满足个性化需求的常规ATAC-seq测序数据分析流程。分析结果揭示了全基因组范围内的染色质开放区域的序列信息,并且可以帮助科研工作者进一步探索找到或预测参与基因组开放性高低变化的转录因子。(The invention provides a biological information analysis method of ATAC-seq sequencing data, which comprises the following steps: analyzing and controlling the ATAC-seq sequencing data; comparing the analyzed and quality-controlled data with a reference genome for analysis; detecting and counting the single sample Peak; extracting and analyzing the consistency peak in the group; merging the peak of each processing group and carrying out multi-sample cluster analysis; common and specific peak analysis among treatment groups; and (4) analyzing difference of peak abundance among groups. The invention researches around the conventional ATAC-seq sequencing data, and constructs a conventional ATAC-seq sequencing data analysis process which has rich analysis content and can meet individual requirements. The analysis results reveal sequence information of chromatin opening regions in the genome-wide range, and can help researchers to further explore finding or predicting transcription factors involved in genome opening variation.)

1. A biological information analysis method of ATAC-seq sequencing data is characterized in that: the method comprises the following steps:

s1, analyzing and controlling the ATAC-seq sequencing data;

s2, comparing the data after analysis and quality control with a reference genome for analysis;

s3, detecting and counting the single sample Peak;

s4, extracting and analyzing consistency peak in the group;

s5, merging the peak of each processing group and carrying out multi-sample cluster analysis;

s6, analyzing common and specific peak among all treatment groups;

and S7, analyzing difference of peak abundance among groups.

2. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the method for analyzing and controlling the ATAC-seq sequencing data in the step S1 comprises the following steps: and filtering the original data of the machine leaving, and removing data containing adapter, data with the proportion of N being more than 10% and data with the number of bases with the quality value Q being less than or equal to 10 accounting for more than 40% of the whole read.

3. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the alignment analysis in S2 specifically includes: comparing the data obtained in the step S1 with reference genome by using comparison software Bowtie2 to perform comparison result statistics, filtering out the data compared with mitochondria or chloroplasts, and performing subsequent biological information analysis on the sequence compared with the unique position on the genome after the comparison quality is confirmed to be qualified, wherein the biological information analysis comprises the following steps: genomic sequencing depth accumulation distribution, distribution of Reads versus TSS position, distribution of Reads on chromosomes, and insert analysis.

4. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: detecting and counting the single sample Peak in the S3, specifically comprising: peak scans were performed across the whole genome and single sample peak statistics were performed, including: the length distribution of the single sample peak, the depth distribution of the single sample peak, the enrichment degree distribution of the single sample peak, the obvious degree distribution of the single sample peak, the distribution of the single sample peak on the gene function element and the distribution of the single sample peak on the chromosome are analyzed.

5. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the step S4 of extracting and analyzing the intra-group consistency peak specifically includes: performing IDR analysis and processing group consistency peak acquisition, and then performing intra-group peak related gene analysis and TF motif analysis of common peak in the group, wherein the intra-group peak related gene analysis comprises the following steps: the distribution of the common peak in the group on the gene functional element, the analysis of the related gene of the common peak in the group, the GO enrichment analysis and the KO enrichment analysis of the related gene of the common peak in the group and the TF motif analysis of the common peak in the group are carried out; TF motif analysis of consensus peak within the group included de novo prediction of TFs motif and enrichment analysis of known TF motif.

6. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the merging of the processing groups peak and the multi-sample cluster analysis in the step S5 specifically include: combining the peak among the groups by using DiffBind software to obtain a union set of the peak among the treatment groups, and calculating the abundance of each peak in each sample; then, performing principal component analysis, namely reducing the high-dimensional information contained in the samples into comprehensive indexes with a plurality of dimensions, and performing comparison among the samples; and finally, carrying out cluster analysis, namely calculating a Pearss correlation coefficient between the two samples, and displaying the correlation coefficient among the samples in a heat map form.

7. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the analysis of common and specific peak among the processing groups in the step S6 specifically includes: firstly, obtaining characteristic and common peaks among different comparison groups through analysis of a Wien diagram; then, corresponding to common or specific peak among the groups, analyzing the peak related genes and carrying out GO and KO enrichment analysis on the peak related genes; finally, for a certain treatment group specific peak, performing TF-motif analysis, including: denovo prediction of TF-motif and enrichment analysis of known TFs-motif.

8. The method for bioinformatic analysis of ATAC-seq sequencing data according to claim 1, wherein: the analysis of difference in peak abundance among groups in step S7 specifically includes: utilizing DiffBind software to carry out difference peak statistics and draw a difference peak statistical graph and a difference comparison volcanic graph, and then carrying out difference peak gene analysis and difference peak related TF-motif analysis, wherein the difference peak gene analysis comprises the following steps: extracting difference peak related genes, and carrying out GO enrichment analysis and KO enrichment analysis on the peak related genes; the difference peak-related TF-motif analysis is a denovo prediction for TF-motif and an enrichment analysis for known TF-motif.

Technical Field

The invention relates to the technical field of biology, in particular to a bioinformation analysis method of ATAC-seq sequencing data, which is used for exploring and developing the analysis process of conventional ATAC-seq sequencing data.

Background

Chromatin is a carrier of genetic material. The eukaryotic nuclear DNA is not naked, but is combined with histone to form a chromosome nucleosome which is a basic structural unit of a chromosome, and the nucleosome is gradually compressed and folded to finally form a chromosome high-level structure (for example, a human DNA chain is completely unfolded to have a length of about 2m, and is folded to form a nanometer-scale to micrometer-scale chromatin structure which can be stored in a small nucleus). The replicative transcription of DNA requires the tight structure of DNA to be opened, so as to allow some regulatory factors, such as transcription factor, other regulatory factors, etc. to bind. This partially open chromatin is called open chromatin. The property of open chromatin that allows the binding of other regulatory factors is called chromatin accessibility (chromatin accessibility). Thus, chromatin accessibility is closely related to transcriptional regulation.

The research method of the open chromatin mainly comprises ATAC-Seq and traditional DNase-Seq and FAIRE-Seq. ATAC-seq (Assay for transposase-accessible chromoprotein with high through put validation) is a method developed in 2013 by William J.Greenleaf and Howard Y.Chang laboratories, university of Stanford, which utilizes the property of Tn5 transposase to readily bind open chromatin, and then sequences DNA sequences captured by Tn5 transposase. ATAC-seq has been the first method to study open chromatin.

The ATAC-seq can detect the opening degree of chromatin in a genome-wide range, and can obtain the information of possible combined sites of proteins in the genome-wide range. The method is widely applied to transcription factor binding analysis, nucleosome positioning, activity regulation and control element distribution and the like, and has wide application prospect in the field of epigenetic mechanism research.

At present, the analysis flow of the data obtained by the conventional ATAC-seq sequencing has no established standard. Therefore, the ATAC-seq sequencing data analysis method which can not only be operated in a standardized way but also meet personalized requirements is urgently needed for researching open chromatin by utilizing ATAC-seq.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a biological information analysis method of ATAC-seq sequencing data.

In order to achieve the purpose, the invention adopts the technical scheme that:

a biological information analysis method of ATAC-seq sequencing data comprises the following steps:

s1, analyzing and controlling the ATAC-seq sequencing data;

s2, comparing the data after analysis and quality control with a reference genome for analysis;

s3, detecting and counting a single sample enrichment area (Peak);

s4, extracting and analyzing consistency peak in the group;

s5, merging the peak of each processing group and carrying out multi-sample cluster analysis;

s6, analyzing common and specific peak among all treatment groups;

and S7, analyzing difference of peak abundance among groups.

Further, the method for analyzing and controlling the ATAC-seq sequencing data in step S1 includes: the raw data of the off-line sequence is filtered, and the data containing a sequencing linker (adapter), the data with the N proportion of more than 10 percent and the data with the number of bases with the quality value Q of less than or equal to 10 accounting for more than 40 percent of the whole sequencing fragment (read) are removed.

Further, the alignment analysis in S2 specifically includes: comparing the data obtained in the step S1 with reference genome by using comparison software Bowtie2 to perform comparison result statistics, filtering out the data compared with mitochondria or chloroplasts, and performing subsequent biological information analysis on the sequence compared with the unique position on the genome after the comparison quality is confirmed to be qualified, wherein the biological information analysis comprises the following steps: genomic sequencing depth accumulation distribution, distribution of data relative to TSS location, distribution of Reads on chromosomes, and insert analysis.

Further, the detecting and counting of the single-sample enrichment area (Peak) in S3 specifically includes: performing peak scanning in a whole genome range and performing single sample peak statistics, wherein the single sample peak statistics comprise: the length distribution of the single sample peak, the depth distribution of the single sample peak, the enrichment degree distribution of the single sample peak, the obvious degree distribution of the single sample peak, the distribution of the single sample peak on the gene function element and the distribution of the single sample peak on the chromosome are analyzed.

Further, the extracting and analyzing of the intra-group consistency peak in the step S4 specifically includes: performing IDR analysis and processing group consistency peak acquisition, and then performing intra-group peak related gene analysis and TF motif analysis of common peak in the group, wherein the intra-group peak related gene analysis comprises the following steps: the distribution of the common peak in the group on the gene functional element, the analysis of the related gene of the common peak in the group, the GO enrichment analysis and the KO enrichment analysis of the related gene of the common peak in the group and the TFmotif analysis of the common peak in the group are carried out; TF motif analysis of consensus peak within the group included: de novo (de novo) prediction of TFs motif and enrichment analysis of known TFmotif.

Further, the merging of the processing groups peak and the multi-sample cluster analysis in the step S5 specifically include: combining the peak among the groups by using DiffBind software to obtain a union set of the peak among the treatment groups, and calculating the abundance of each peak in each sample; then, performing Principal Component Analysis (PCA), namely reducing the high-dimensional information contained in the samples into comprehensive indexes with a plurality of dimensions, and performing comparison among the samples; finally, clustering analysis is carried out, namely a Pearson correlation coefficient between two samples is calculated, and the correlation coefficient is used for displaying the correlation between the samples in a heat map form.

Further, the analysis of common and unique peak among the processing groups in step S6 specifically includes: firstly, obtaining characteristic and common peaks among different comparison groups through analysis of a Wien diagram; then, corresponding to common or specific peak among the groups, analyzing the peak related genes and carrying out GO and KO enrichment analysis on the peak related genes; finally, Transcription Factor (TF) motif analysis was performed on peaks specific to a certain treatment group, including: denovo prediction of TF-motif and enrichment analysis of known TFs-motif.

Further, the analysis of difference in peak abundance among groups in step S7 specifically includes: utilizing DiffBind software to carry out difference peak statistics and draw a difference peak statistical graph and a difference comparison volcanic graph, and then carrying out difference peak gene analysis and difference peak related TF-motif analysis, wherein the difference peak gene analysis comprises the following steps: extracting difference peak related genes, and carrying out GO enrichment analysis and KO enrichment analysis on the peak related genes; the difference peak-related TF-motif analysis is a denovo prediction for TF-motif and an enrichment analysis for known TF-motif.

The invention has the beneficial effects that: the invention researches around the conventional ATAC-seq sequencing data, and constructs a conventional ATAC-seq sequencing data analysis process which has rich analysis content and can meet individual requirements. The analysis results reveal sequence information of chromatin opening regions in the genome-wide range, and can help researchers to further explore finding or predicting transcription factors involved in genome opening variation. The analysis process is clear in order and strong in logicality, the analysis result is displayed in a webpage version question report form, the hierarchy is clear, and the hyperlink is arranged for help explanation, so that the analysis content and the operation can be understood more deeply. In addition, the complete statistical data information in the display result of the viewing report is supported.

Drawings

FIG. 1 is a flow chart of the method for analyzing biological information of ATAC-seq sequencing data according to the present invention.

FIG. 2 is a sample filter frequency distribution diagram in example 1 of the present invention.

FIG. 3 is a statistical chart of the cumulative distribution of the LA-1 genome sequencing depth of the sample in example 1 of the present invention.

FIG. 4 is a graph of the location distribution of the samples LA-1Reads relative to TSS in example 1 of the present invention.

FIG. 5 is a graph showing the distribution of signals around the LA-1TSS sample in example 1 of the present invention.

FIG. 6 is a distribution diagram of the inserted fragments in example 1 of the present invention.

FIG. 7 is a graph showing the distribution of the LA-1peak length of the samples obtained in example 1 of the present invention.

FIG. 8 is a graph showing the LA-1peak depth profile of a sample obtained in example 1 of the present invention.

FIG. 9 is a graph showing the distribution of the enrichment times of LA-1peak in the sample of example 1 of the present invention.

FIG. 10 is a graph showing the distribution of the degree of significance of LA-1peak of the sample in example 1 of the present invention.

FIG. 11 is a pie chart showing the distribution of LA-1peak on the functional elements of the gene in example 1 of the present invention.

FIG. 12 is a chromosome map of LA-1peak of a sample obtained in example 1 of the present invention.

FIG. 13 is a map of the distribution of consensus peak of sample LA on gene function elements in example 1 of the present invention.

FIG. 14 is a histogram of the peak associated gene GO enrichment classification of sample LA in example 1 of the present invention.

FIG. 15 is a bubble chart of enrichment analysis of the peak-associated gene KO of sample LA in example 1 of the present invention.

FIG. 16 is a sequence diagram of significant motif in sample LA predicted by MEME software in example 1 of the present invention.

FIG. 17 is a graph of motif-enriched bubbles for each sample in example 1 of the present invention.

FIG. 18 is a heat map of correlation analysis between samples in example 1 of the present invention.

FIG. 19 is a graph of peak Weinn between NC and LA groups of samples in example 1 of the present invention.

FIG. 20 is a comparison of the difference peak between NC and LA in the samples of example 1 of the present invention.

FIG. 21 is a bar graph of GO enrichment classification of related genes of difference peak between NC and LA in sample in example 1 of the present invention.

FIG. 22 is a histogram of KO enrichment for genes involved in the difference peak between NC and LA in example 1 of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于相似性的病毒-受体相互作用关系预测方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!