Processing method of Pacbio third-generation sequencing data

文档序号：1339747 发布日期：2020-07-17 浏览：30次中文

阅读说明：本技术 Pacbio三代测序数据的处理方法 (Processing method of Pacbio third-generation sequencing data ) 是由田仕林王雪涵曹丽蓉于 2020-04-23 设计创作，主要内容包括：本发明公开了一种Pacbio三代测序数据的处理方法。该处理方法包括：通过程序对来自Pacbio三代测序系统的测序数据进行处理,其中,程序的输入包括测序数据,程序调用至少一个工具对测序数据进行处理,其中,程序在调用至少一个工具之前,将需要输入到至少一个工具的数据配置为该工具相匹配的格式；程序将格式匹配之后的数据作为输入数据输入到至少一个工具。应用本发明的技术方案,建成一整套自动化Pacbio三代测序数据分析流程,从而方便快捷的完成对Pacbio基因组重测序数据的分析,解决了现有技术中对三代测序的结果文件进行人工整理耗时长,效率低的技术问题。(The invention discloses a processing method of Pacbio third-generation sequencing data. The processing method comprises the following steps: processing sequencing data from a Pacbio three-generation sequencing system by a program, wherein the input of the program comprises the sequencing data, the program calls at least one tool to process the sequencing data, and the program configures the data required to be input into the at least one tool into a format matched with the tool before calling the at least one tool; the program inputs the data after format matching as input data to at least one tool. By applying the technical scheme of the invention, a whole set of automatic Pacbio third-generation sequencing data analysis process is built, so that the analysis of the Pacbio genome re-sequencing data is conveniently and quickly completed, and the technical problems of long time consumption and low efficiency in manual arrangement of third-generation sequencing result files in the prior art are solved.)

1. A processing method of Pacbio third-generation sequencing data is characterized by comprising the following steps:

processing sequencing data from a Pacbio three-generation sequencing system by a program, wherein the input to the program comprises the sequencing data, the program invokes at least one tool to process the sequencing data, wherein the program configures data that needs to be input to the at least one tool into a format that matches the tool prior to invoking the at least one tool; the program inputs the data after format matching as input data to the at least one tool.

2. The processing method according to claim 1, wherein the at least one tool invoked by the program comprises at least one of:

reading the length of reads, comparing with a reference genome, processing the compared result, detecting the structure variation, detecting the copy number variation, detecting the single nucleotide polymorphism site and the insertion deletion site of the sample, and counting the variation type number.

3. The process of claim 2, wherein the program obtains results output after invoking the at least one tool for processing and generates a report based on the results.

4. The processing method according to claim 2, characterized in that the input of the program comprises at least: the data processing method comprises the steps of data to be processed and a tool to be called, wherein the program calls the tool to be called to process the data to be processed.

5. The process of claim 2 wherein the program invokes the means for aligning to a reference genome, the means for processing the results of the alignment, the means for detecting structural variation, the means for detecting copy number variation, the means for detecting single nucleotide polymorphism sites and indels in the sample, and the means for counting the number of types of variation in sequence, and wherein the program configures the output of a previous tool into a format that matches the next tool and inputs it to the next tool before invoking the next tool.

6. The process of claim 2, wherein the program saves the result file generated by each tool called to a directory.

7. A processing method according to any one of claims 1 to 6, characterized in that after the program is run, a script for delivering tasks is generated and delivered to the SGE task system.

8. The processing method according to claim 2,

the tool for reading the length of reads is samtools; and/or the presence of a gas in the gas,

the means of alignment with the reference genome were ngmlr and pbsmrtpipe; and/or the presence of a gas in the gas,

the tools for processing the compared result are samtools and pbsmrtpipe; and/or the presence of a gas in the gas,

the means for detecting structural variation is sniffles; and/or the presence of a gas in the gas,

the tool for detecting copy number variation is control-fresh; and/or the presence of a gas in the gas,

a tool pbsmrtpipe for detecting the sample single nucleotide polymorphism sites and the insertion deletion sites; and/or the presence of a gas in the gas,

a tool annovar for counting the number of types of variation.

9. The processing method according to any one of claims 1 to 6, wherein the program is perl language and shell language.

Technical Field

The invention relates to the technical field of biological information, in particular to a processing method of Pacbio three-generation sequencing data.

Background

The genome is all the genetic information in the cell of an organism and is stored in the form of nucleotides. The high-throughput sequencing of genome can further research various genetic information of organisms, and has an important effect on deciphering the relationship between genes and characters. At the present stage, high-throughput sequencing technologies are various types, wherein Pacbio sequencing based on a single-molecule real-time (SMRT) sequencing technology is long in reading length and high in throughput, and can ensure uniform coverage. The increase in sequencing data also increases the need for efficient analysis of the data.

Currently, there are methods for automatically analyzing the second generation sequencing data, but the analysis of the third generation data requires manual work to arrange and operate each step. The Pacbio genome re-sequencing data analysis steps are relatively fixed, but the result files of each step need to be manually sorted and then linked with the next step, so that the time consumption is long, and the efficiency is low.

Disclosure of Invention

The invention aims to provide a processing method of Pacbio third-generation sequencing data, and aims to solve the technical problems that manual arrangement of third-generation sequencing result files is long in time consumption and low in efficiency in the prior art.

In order to achieve the above object, according to one aspect of the present invention, a method for processing Pacbio three-generation sequencing data is provided. The processing method comprises the following steps: processing sequencing data from a Pacbio three-generation sequencing system by a program, wherein the input of the program comprises the sequencing data, the program calls at least one tool to process the sequencing data, and the program configures the data required to be input into the at least one tool into a format matched with the tool before calling the at least one tool; the program inputs the data after format matching as input data to at least one tool.

Further, the at least one facility for program invocation includes at least one of: reading the length of reads, comparing with a reference genome, processing the compared result, detecting the structure variation, detecting the copy number variation, detecting the single nucleotide polymorphism site and the insertion deletion site of the sample, and counting the variation type number.

Further, the program obtains a result output after calling at least one tool for processing, and generates a report according to the result.

Further, the input of the program includes at least: the data processing method comprises the steps of data to be processed and a tool to be called, wherein a program calls the tool to be called to process the data to be processed.

Further, the program calls a tool for alignment with a reference genome, a tool for processing the result after alignment, a tool for detecting structural variation, a tool for detecting copy number variation, a tool for detecting single nucleotide polymorphism sites and insertion deletion sites of a sample, and a tool for counting the number of types of variation in sequence, and before calling the next tool, the program configures the output of the previous tool into a format matched with the next tool and inputs the output into the next tool.

Further, the program saves the result file generated by each tool called to a directory.

Further, after the program is executed, a script for delivering the task can be generated and delivered to the SGE task system.

Further, the tool for reading the length of reads is samtools; and/or, the means of alignment with the reference genome is ngmlr and pbsmrtpipe; and/or the tools for processing the compared result are samtools and pbsmrtpipe; and/or the means for detecting structural variation is sniffles; and/or, the means for detecting copy number variation is control-fresh; and/or, a means for detecting a sample single nucleotide polymorphism site and an insertion deletion site, pbsmrtpipe; and/or a tool annovar for counting the number of types of variation.

Further, the programs are perl language and shell language.

By applying the technical scheme of the invention, a whole set of automatic Pacbio third-generation sequencing data analysis process is built, so that the analysis of the Pacbio genome re-sequencing data is conveniently and quickly completed, and the technical problems of long time consumption and low efficiency in manual arrangement of third-generation sequencing result files in the prior art are solved.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

Aiming at the technical problems that manual arrangement of third-generation sequencing result files is long in time consumption and low in efficiency in the prior art, the invention provides the following technical scheme, aims to realize automatic analysis of each step of off-line data after configuring corresponding parameters based on the integration of a third-generation Pacbio genome re-sequencing data analysis process, preferably further generates an analysis report, saves the labor and time of the analysis process and improves the analysis efficiency.

According to an exemplary embodiment of the present invention, a method for processing Pacbio three-generation sequencing data is provided. The processing method comprises the following steps: processing sequencing data from a Pacbio three-generation sequencing system by a program, wherein the input of the program comprises the sequencing data, the program calls at least one tool to process the sequencing data, and the program configures the data required to be input into the at least one tool into a format matched with the tool before calling the at least one tool; the program inputs the data after format matching as input data to at least one tool.

In an exemplary embodiment of the invention, the at least one facility for program invocation includes at least one of: reading the length of reads, comparing with a reference genome, processing the compared result, detecting the structure variation, detecting the copy number variation, detecting the single nucleotide polymorphism site and the insertion deletion site of the sample, and counting the variation type number. The gene variation in the Pacbio third generation sequencing data was analyzed using the above-described tool, providing analysis efficiency.

Preferably, the program acquires a result output after calling at least one tool for processing, and generates a report according to the result, so that the manpower and time in the analysis process are saved, and the analysis efficiency is improved.

In an exemplary embodiment of the invention, the program input comprises at least: the data processing method comprises the steps of data to be processed and a tool to be called, wherein a program calls the tool to be called to process the data to be processed. Preferably, the program invokes a tool for aligning with a reference genome, a tool for processing the results of the alignment, a tool for detecting structural variation, a tool for detecting copy number variation, a tool for detecting single nucleotide polymorphism sites and indel sites of the sample, and a tool for counting the number of types of variation in sequence, and before invoking the next tool, the program configures the output of the previous tool into a format matching the next tool and inputs the output into the next tool.

According to an exemplary embodiment of the present invention, the program saves the result file generated by each tool called to a directory. Preferably, the program is executed to generate a script for delivering the task to the SGE task system.

In an exemplary embodiment of the invention, the tool for reading the length of reads is samtools; and/or, the means of alignment with the reference genome is ngmlr and pbsmrtpipe; and/or the tools for processing the compared result are samtools and pbsmrtpipe; and/or the means for detecting structural variation is sniffles; and/or, the means for detecting copy number variation is control-fresh; and/or, a means for detecting a sample single nucleotide polymorphism site and an insertion deletion site, pbsmrtpipe; and/or, a tool annovar for counting the number of types of variation; typically, the programs are perl language and shell language.

In an embodiment of the present invention, the mutation detection analysis method based on the Pacbio three-generation sequencing comprises the following steps: 1) original offline data quality control; 2) comparing the off-line data with the reference genome, and counting the comparison effect; 3) detecting Copy Number Variation (CNV) and Structural Variation (SV) of the sample, and annotating variation sites; 4) detecting Single Nucleotide Polymorphism (SNP) and insertion deletion site (InDel) of a sample, and annotating the variation site; 5) automatically obtaining an analysis report according to the processing method of the Pacbio three-generation sequencing data; 6) and (5) automatically sorting and analyzing the obtained main result file. Preferably, in this embodiment, the method specifically includes: (1) reading the length of each reads in the data of the mobile terminal by using software samtools, and counting the number of reads, the base number, the average value of the reads length and the N50 value obtained by sequencing by using shell (stat. sh) script; (2) integrating the whole comparison process by using a script Pacbio _ mapping.sh, comparing the offline data with a reference genome by using software ngmlr to obtain a compared bam file, sequencing the compared bam file by using software samtools, establishing an index file, and sorting a comparison rate result obtained in the comparison process; (3) detecting Structural Variation (SV) by using sniffles, and detecting Copy Number Variation (CNV) by using Control-Freec soft; (4) detecting Single Nucleotide Polymorphism (SNP) and insertion deletion (InDel) sites of a sample by using pbsmrtpipe software; (5) respectively annotating the mutation site files (vcf) obtained by the detection in the steps (3) and (4) by using annovar software, and counting the number of each mutation type; (6) the report is automatically generated using the perl script (Pacbio _ report. pl) of the present invention. (7) And the result files generated in each step are automatically sorted to a directory, so that the checking and the later analysis are convenient.

The perl script is a main program pipeline, and the perl script and the shell script which comprise format conversion, result extraction, data arrangement, report generation and the like can be called in the pipeline. The script is generated according to the sample information, and integrates the sub-scripts of the whole process according to the corresponding analysis sequence (the script is also automatically generated and comprises the steps of processing data by using software, processing a format by using a script, collecting information by using the script and the like).

The Pipline in the embodiment of the invention is explained as follows:

the script is an explanation of the usage of the pipeline in an embodiment of the present invention, and is explained as follows:

-infile: the input file contains the sample name, gender, and path of the sample raw data. If a sample has multiple paths, pipline can also be processed automatically.

- -analog _ array: and (4) selecting an analysis module.

1, carrying out QC statistics on original data, generating a QC report, and putting the original data and converted data (fasta) under a Result/QC path of a current path, 2, carrying out Mapping operation by using NGM L R software, generating a Mapping report, arranging a Result file (Aligned bam), 3, carrying out Variation Calling by using Smrtlink pipe, generating a Variation report, arranging a Variation comment Result file (vcf, annovar. hg19_ multianno.xls) to Result/primary.4 and 5, respectively, namely CNV and SV Calling, and generating an SV report, arranging a Variation comment Result file (gff, hg19_ multianno.xls) to Result/primary.

- -newjob: to generate the name of the SGE task delivery system script.

- -startpoint: the start site. The starting analysis site in the SGE task delivery system script can be changed according to the requirement. The run can be from the middle of the flow.

After the above example is executed, a file named word _ ana. job can be obtained, and the command is executed: job can start the whole set of flow, after the flow runs out, a Result Report can be found in the Report directory of the current directory, and a useful analysis Result can be found in the Result directory of the current path.

The above-mentioned script runs the generated report (integrated in the general script, according to — reporttype, specify the type of the generated report (qc, sv, variation), and other parameters are the same as the main script parameters):

the invention integrates various software and scripts including data analysis, data processing and the like, and the software and the scripts are combined together according to a certain sequence, so that reports and results including SV, CNV, SNP and INDE L variation information can be obtained as long as sample data is input and the pipeline is operated.

The method is written aiming at a cluster SGE task management system, a jobscript of a delivery task can be generated by running the pipline, the jobscript is delivered to an SGE task system, the flow processing data can be automatically run according to a certain sequence, and the operation can be continued according to the current flow position after the stoppages are delivered again; the invention selects certain authoritative and widely used software to process the data, so that a user of the invention can automatically set the optimal parameters without searching which software is used for data analysis; the invention can also integrate the script for automatically generating the result report and the script for the result data, wherein the two scripts are arranged at the end of the whole process, the two scripts are sorted according to the analysis results of various software before the two scripts are sorted, and the obtained sorted data is put into the result report to be used as the display of the whole set of data analysis results. The analysis result can be sorted by a script for sorting result data, and a result file which is useful for research such as scientific research is selected and stored in a specified directory, so that the analysis result is convenient to look up.

From the above description, it can be seen that the above-described embodiments of the present invention achieve at least the following technical effects:

(1) the method can be used for analyzing based on third-generation Pacbio genome retest data, and solves the technical problems that manual arrangement of third-generation sequencing result files in the prior art is long in time consumption and low in efficiency;

(2) the module can be selected according to the analysis content, the step of analysis can be designated, and the flexibility is strong;

(3) the report can be automatically generated, and the letter generation personnel can conveniently and quickly browse the analysis result.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

7页详细技术资料下载

Processing method of Pacbio third-generation sequencing data

相关技术

网友询问留言