Method for rapidly judging sample gender from FASTQ file

文档序号:1965072 发布日期:2021-12-14 浏览:17次 中文

阅读说明:本技术 一种从fastq文件中快速判断样本性别的方法 (Method for rapidly judging sample gender from FASTQ file ) 是由 吴星辰 栗海波 梁萌萌 余伟师 于 2021-09-29 设计创作,主要内容包括:本发明公开了一种从FASTQ文件中快速判断样本性别的方法,包括:(1)根据参考基因组,生成Y染色体上特有的K-mer;(2)获取全外显子组测序捕获探针的设计区间的交集,剔除在交集外的K-mer,将留存的K-mer以在捕获探针的设计区间出现的次数从多到少的顺序排列,选取靠前的K-mer作为特有K-mer集合;(3)随机读取FASTQ文件,对特有K-mer进行计数,并使用相同男女数量的真实数据分析特有K-mer在不同性别FASTQ文件中的分布差异,确定性别判断阈值;(4)根据阈值,对FASTQ文件进行性别判定。该方法适用于NGS的多种数据类型,分析流程简单,操作方便,大大提高了判断效率。(The invention discloses a method for quickly judging the sex of a sample from a FASTQ file, which comprises the following steps: (1) generating a unique K-mer on the Y chromosome according to the reference genome; (2) acquiring an intersection of a design interval of a full exome sequencing capture probe, removing K-mers outside the intersection, arranging the retained K-mers in a sequence with more or less times of appearance in the design interval of the capture probe, and selecting the front K-mer as a special K-mer set; (3) randomly reading a FASTQ file, counting the special K-mers, analyzing the distribution difference of the special K-mers in different sexes FASTQ files by using the real data of the same number of men and women, and determining a gender judgment threshold; (4) according to the threshold value, the gender judgment is carried out on the FASTQ file. The method is suitable for various data types of the NGS, simple in analysis process and convenient to operate, and greatly improves judgment efficiency.)

1. A method for rapidly determining sample gender from FASTQ files, comprising the steps of:

(1) generating a unique K-mer on the Y chromosome according to the reference genome;

(2) acquiring the intersection of the design intervals of the whole exome sequencing capture probes from different sources, removing K-mers outside the intersection, arranging the retained K-mers in a sequence with more or less times of appearance in the design interval of the capture probes, and selecting the K-mers with the front preset digits as a final special K-mer set;

(3) randomly reading data of FASTQ files with different sexes, counting the special K-mers contained in the data, analyzing the distribution difference of the special K-mers in the FASTQ files with different sexes by using the real data with the same number of men and women, and determining a gender judgment threshold;

(4) according to the threshold value, the gender judgment is carried out on the FASTQ file.

2. The method according to claim 1, wherein the specific operation method of step (1) is as follows:

a. acquiring a reference sequence of a FASTA format of a reference genome;

b. the reference sequence is split into two sequence files according to chromosomes: y chromosome and other chromosomes;

c. setting different lengths of the K-mers, and respectively carrying out K-mer counting on the two sequence files by using a Jellyfish program module;

d. comparing the K-mer sets of the two sequence files to obtain a unique K-mer on the Y chromosome;

e. the length of the unique K-mer on the Y chromosome was determined to be 13.

3. The method of claim 1, wherein said extraintersect inclusion in step (2) comprises a coverage of less than 50% and a frequency of occurrence on the Y chromosome of less than 3.

4. The method of claim 1, wherein the first predetermined number of bits in step (2) is the first 100 bits.

5. The method according to claim 1, wherein the threshold in step (3) comprises an upper threshold U and a lower threshold L of the number of K-mers, data greater than U are male, and data less than L are female; and when the number of K-mers is between L and U, judging that the contamination among samples with different genders exists.

6. The method as claimed in claim 1, wherein the step (3) randomly reads data of FASTQ files of different genders, and the number of FASTQ files is 10 ten thousand.

7. The method of claim 1, wherein the FASTQ file is a FASTQ file generated by whole gene sequencing or whole exome sequencing.

Technical Field

The invention relates to the technical field of high-throughput sequencing and mutation detection in biology and precise medicine, in particular to a method for quickly judging the sex of a sample from a FASTQ file.

Background

With the rapid development of modern medicine, the cost of the Next-Generation Sequencing (NGS) is also lower and lower, and the NGS is gradually becoming the first choice method for detecting genetic diseases, tumors and other genes. FASTQ is the most common file format used to store NGS sequencing bases and corresponding mass scores, as well as other relevant information. FASTQ is also the raw data for sequencing data delivery and genomic analysis, on the basis of which NGS data and results in other formats, such as alignment files BAM and mutation detection files VCF, can be obtained by a large number of calculations. Researchers typically need to verify the sex of the sample and the sex of the data when analyzing NGS data, which is important for determining whether the data and the sample are consistent, whether contamination exists, and subsequent analysis of chromosome copy number and interpretation of variation.

At present, the mainstream research thinking for judging the sex of NGS data is to analyze the coverage of specific genes on X chromosome and Y chromosome from BAM or analyze the genotype distribution on X chromosome and Y chromosome from VCF, and these methods have the following disadvantages:

(1) generating comparison files BAM and mutation detection files VCF from FASTQ requires a large amount of computing resources and storage space, and the analysis process usually consumes several hours to tens of hours according to the difference of data amount, and the defects are more obvious in some application scenes that only need to judge the gender of the data and do not need subsequent analysis temporarily.

(2) Most of software used in the analysis process can only run in a Linux system, the difficulty of installing and running the software on a Windows computer is high, a lot of data are delivered through network disk software of the Windows system, gender judgment needs to be uploaded to a Linux server, and inconvenience is brought to analysts.

Therefore, a new technical scheme is urgently needed by an analyst, which can rapidly judge the gender of the sample and the pollution among samples of different genders from the FASTQ file while remarkably reducing the resource requirement and the system dependence.

Disclosure of Invention

The present invention aims to solve the above problems in the prior art, and provide a method for rapidly determining sample gender from FASTQ files, which can significantly reduce resource requirements, reduce system dependence, and rapidly determine sample gender.

The technical scheme of the invention is detailed as follows:

a method for rapidly determining sample gender from FASTQ files, comprising the steps of:

(1) generating a unique K-mer on the Y chromosome according to the reference genome;

(2) acquiring the intersection of the design intervals of the whole exome sequencing capture probes from different sources, removing K-mers outside the intersection, arranging the retained K-mers in a sequence with more or less times of appearance in the design interval of the capture probes, and selecting the K-mers with the front preset digits as a final special K-mer set;

(3) randomly reading data of FASTQ files with different sexes, counting the special K-mers contained in the data, analyzing the distribution difference of the special K-mers in the FASTQ files with different sexes by using the real data with the same number of men and women, and determining a gender judgment threshold;

(4) according to the threshold value, the gender judgment is carried out on the FASTQ file.

Optionally or preferably, in the above method, the threshold includes an upper threshold U and a lower threshold L of the number of K-mers, data greater than U are males, and data less than L are females; and when the number of K-mers is between L and U, judging that the contamination among samples with different genders exists.

Alternatively or preferably, in the above method, the FASTQ file is a FASTQ file generated by whole gene sequencing or whole exome sequencing.

Alternatively or preferably, in the above method, the step (2) of out-of-intersection comprises coverage of less than 50% and a frequency of occurrence on the Y chromosome of less than 3.

Optionally or preferably, in the above method, the first preset number of bits in step (2) is the first 100 bits.

Alternatively or preferably, in the above method, in the step (3), the data of FASTQ files of different genders are randomly read, and the number of the FASTQ files is 10 ten thousand.

Compared with the prior art, the invention has the following beneficial effects:

the judgment method is based on the specific K-mers of the Y chromosome, the specific K-mers only exist in the data of male samples theoretically and contain possible sex information, and the difference of the occurrence frequency of the K-mers in FASTQ with different sexes is utilized to determine the division threshold of male and female data, so that the sex of the data and the pollution among samples with different sexes can be judged from the NGS original data.

K-mers which are not covered or have low coverage rate and K-mers with relatively low occurrence frequency on the Y chromosome are removed, so that the robustness and the calculation speed of the K-mers can be further improved.

In addition, the invention has the following advantages:

1. the method has the advantages of quick judgment process and no need of a large amount of computing resources

The conventional judgment of data sex from comparison file BAM or mutation detection file VCF requires several to several tens of hours on a specific server. The processing flow designed by the invention is simple to deploy and convenient to use and operate, and the whole flow analysis can be completed only by deploying the related executable file. The requirement on the computing resources of the server is low, and a common notebook computer can judge the sex of dozens of FASTQs every minute by utilizing multiple threads, so that the efficiency is very high.

2. No dependence on operating system, and wide application range

The method is applicable to various data types of the current NGS, including whole genome sequencing data of different depths and whole exome sequencing data of various capture probes; the method is not only suitable for large Linux servers, but also suitable for personal Windows notebook computers.

Drawings

FIG. 1 is a flowchart of the whole judgment method in embodiment 1;

FIG. 2 is a first part of the flowchart of example 1;

FIG. 3 is a second part of the flowchart of example 1;

FIG. 4 is a flow chart of a third part of example 1;

FIG. 5 is a fourth flowchart of the embodiment 1.

Detailed Description

The technical solutions of the present invention are described in detail below with reference to the accompanying drawings and preferred embodiments so that those skilled in the art can better understand the present invention and implement the present invention.

Example 1

Referring to fig. 1, the method for rapidly determining the gender of a sample from a FASTQ file includes the following parts:

a first part: generating a unique K-mer on the Y chromosome according to the reference genome;

a second part: according to the probe interval and the occurrence frequency, screening the specific K-mer on the Y chromosome;

and a third part: analyzing the distribution difference of the screened K-mer in FASTQ with different sexes by using real data, thereby determining the threshold value for judging the gender;

the fourth part: according to the threshold value, the FASTQ of the NGS data is subjected to gender judgment.

The detailed steps of each section are specifically described below.

A first part: generation of unique K-mers on the Y chromosome from a reference genome

By comparing the differences between the K-mers on the Y chromosome and other chromosomes on the reference genome, unique K-mers on the Y chromosome are found, which theoretically would only be present in the data of male samples, with possible gender information. The specific process is shown in FIG. 2.

Inputting: a reference sequence of the human genome,

and (3) outputting: the K-mer specific for the Y chromosome.

The method comprises the following steps:

(1) reference sequences in the human genome FASTA format, e.g., hg38.fa. gz, are downloaded from UCSC or other public databases.

(2) The reference sequence is split into two parts per chromosome using a script: y chromosome sequences (y.fa) and other chromosome sequences (other.fa).

(3) Different lengths of the K-mers are set, in this embodiment, lengths of 7, 9, 11, 13, 15, 17, 19 and 21 are set, and the two sequence files in the step (2) are counted by using a Jellyfish software module.

(4) And comparing the K-mer sets of the two sequence files to find out the unique K-mer on the Y chromosome.

(5) The length of the K-mer is determined to be 13, taking into account the running time and the number of the particular K-mers.

A second part: according to the probe interval and the occurrence frequency, the specific K-mer on the Y chromosome is screened

In order to enable the specific K-mer on the Y chromosome to be well covered in different sequencing technologies and capture probes, the probes need to be captured according to the mainstream full exome group of different sources (produced by different manufacturers) in the market, a set of design intervals of the capture probes is obtained, K-mers which are not covered or have low coverage rate are filtered out, meanwhile, K-mers with relatively low frequency of occurrence on the Y chromosome are removed, and the robustness and the calculation speed of the K-mers are improved. The remaining K-mers are arranged in the order of the number of occurrences in the design interval of the capture probe from the largest to the smallest, and the first 100K-mers are selected as the final unique K-mer set, and the specific flow is shown in FIG. 3.

Inputting: the specific K-mer of the Y chromosome, the probe capture interval;

and (3) outputting: unique K-mers after screening.

The method comprises the following steps:

(1) obtaining design intervals of the whole exome sequencing capture probe from different probe design companies;

(2) acquiring the intersection of the design intervals of the probe capture probes of different design companies by using a program tool bedtk;

(3) removing K-mers outside the intersection of the design interval of the capture probe;

(4) arranging the K-mers in reverse order according to the occurrence times in the capture probe design interval;

(5) and selecting the first 100K-mers as a final specific K-mer set.

And a third part: analyzing the distribution difference of the screened K-mer in FASTQ with different sexes by using real data, thereby determining the threshold value for judging the gender;

10 ten thousand pieces of data (containing different sexes) of the FASTQ file are read randomly, and the specific K-mers after the second part of screening are counted by using the script, namely, the number of the specific K-mers in the FASTQ file is calculated. A large amount of real data of the same number of men and women are used for statistics, distribution differences of the special K-mer in FASTQ files of different sexes are analyzed, and an upper limit threshold (U, data larger than the threshold are males) and a lower limit threshold (L, data smaller than the threshold are females) of the K-mer, which can well distinguish the sexes of the men and the women, are marked. Meanwhile, if the number of K-mers is between L and U (L-U), contamination between samples of different genders may exist, and the specific process is shown in FIG. 4.

Inputting: the screened specific K-mer, FASTQ and true sex;

and (3) outputting: a threshold for gender determination.

The method comprises the following steps:

(1) randomly reading 10 ten thousand pieces of data of a FASTQ file;

(2) counting the screened K-mers by using a script;

(3) threshold partitioning is performed according to the true gender of the data.

The fourth part: gender determination of FASTQ of NGS data based on threshold

For FASTQs generated by Whole Gene Sequencing (WGS) or Whole Exome Sequencing (WES), we can count the screened unique K-mers obtained from the second part and determine gender by combining the threshold interval obtained from the third part, see fig. 5.

Inputting: the screened specific K-mer, FASTQ and a threshold value for judging the sex;

and (3) outputting: the result of the sex determination.

The method comprises the following steps:

(1) randomly reading 10 ten thousand pieces of data of a FASTQ file;

(2) counting the screened specific K-mers by using a script;

(3) and judging the gender according to the threshold value.

The method adopts the specific K-mer on the Y chromosome as a judgment basis, carries out gender judgment on the NGS data by randomly sampling from the original FASTQ data, is suitable for various data types of the NGS, has simple analysis process and convenient operation, can complete the whole-process analysis only by deploying related executable files, can carry out gender judgment on dozens of FASTQ by using a common notebook computer and utilizing multithreading every minute, and has greatly improved efficiency compared with the traditional method of calculating for hours to tens of hours on a specific server.

The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种高通量测序变异风险分组筛选方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!