Method for exploring disease subtype affinity by using genome data

文档序号:1143092 发布日期:2020-09-11 浏览:9次 中文

阅读说明:本技术 一种利用基因组数据探究疾病亚型亲缘性的方法 (Method for exploring disease subtype affinity by using genome data ) 是由 侯群星 袁卫兰 高军晖 林灵 吴昊天 蒋丽莎 李无霜 王瑶瑶 吴守信 许骋 于 2020-05-28 设计创作,主要内容包括:本发明提供了一种利用基因组数据探究疾病亚型亲缘性的方法,所述方法包括采用超几何分布检验计算基因非沉默突变富集值的步骤;所述基因非沉默突变富集值的计算公式为:<Image he="120" wi="248" file="DDA0002513874290000011.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>其中,n<Sub>f</Sub>为基因在疾病亚型中发生非沉默突变的样本数,N<Sub>f</Sub>为基因在所有样本中发生非沉默突变的样本数,n为疾病亚型样本数,N为样本总数;所述疾病亚型的分组数量不小于3。本发明在进行疾病亚型亲缘性分析之前,首先利用超几何分布检验计算基因在各肿瘤亚型中的非沉默突变富集分数,再利用此富集分数进行亲缘性分析,减小了样本总数、疾病亚型样本数等背景因素对分析结果造成的影响,提高了方法的准确性。(The invention provides a method for exploring disease subtype affinity by using genome data, which comprises the steps of calculating a gene non-silent mutation enrichment value by adopting a hyper-geometric distribution test; the calculation formula of the gene non-silent mutation enrichment value is as follows: wherein n is f Number of samples of genes with non-silent mutations in disease subtypes, N f The number of samples of the gene which has non-silent mutation in all samples, N is the number of samples of disease subtype, and N is the total number of samples; the number of subgroups of disease subtypes is not less than 3. The invention is about to treat diseasesBefore disease subtype affinity analysis, the non-silent mutation enrichment fraction of the gene in each tumor subtype is calculated by using hyper-geometric distribution test, and affinity analysis is performed by using the enrichment fraction, so that the influence of background factors such as total number of samples, disease subtype sample number and the like on an analysis result is reduced, and the accuracy of the method is improved.)

1. A method for analyzing disease subtype affinity, which is characterized by comprising the steps of obtaining a gene non-silent mutation enrichment value by adopting a hyper-geometric distribution test;

the gene non-silent mutation enrichment value is the ratio of the proportion of the gene which has non-silent mutation in the disease subtype to the proportion of the gene which has non-silent mutation in all samples;

the number of subgroups of disease subtypes is not less than 3.

2. Method according to claim 1, characterized in that it comprises the following steps:

(1) sequencing the tumor and normal samples to obtain sequencing data of all exons;

(2) analyzing the somatic mutation condition of the sample according to the sequencing data;

(3) annotation of the mutation sites;

(4) carrying out format conversion on the annotation result, and annotating the mutation type;

(5) screening out non-silent mutation types according to the annotated mutation types;

(6) calculating the enrichment value of the gene non-silent mutation in the sample;

(7) according to the gene non-silent mutation enrichment value, the affinity between disease subtypes is calculated by using a hierarchical clustering method.

3. The method of claim 2, wherein the step of analyzing of step (2) is:

1) filtering the obtained sequencing data of the whole exons, and screening the sequencing data of which the Q20 is more than or equal to 90 percent and the Q30 is more than or equal to 80 percent;

2) constructing a comparison index for a reference genome, and comparing the sequencing data screened in the step 1) to the reference genome to obtain compared data;

3) counting the proportion of the sequencing sequences aligned to the reference genome in the sequencing data;

4) calculating the depth, average comparison quality and coverage of the compared data obtained in the step 2);

5) counting the proportion, the average coverage depth and the coverage of the compared data obtained in the step 2) in a target region of a reference genome;

6) marking the PCR repetitive sequence in the compared data obtained in the step 2), and removing the duplication of the PCR repetitive sequence;

7) carrying out locus correction on the data obtained in the step 6) after the duplication is removed;

8) grouping the corrected data in the step 7);

9) filtering the data grouped in the step 8) to obtain the initial somatic mutation with the variation frequency of more than 5%.

4. The method of claim 2, wherein step (3) comprises: and (3) screening the initial somatic mutation with the variation frequency of more than 5% obtained in the step (2), and annotating the screened somatic mutation.

5. The method according to claim 2, wherein the non-silent mutation type of step (5) comprises any one or a combination of at least two of a frameshift deletion mutation, a frameshift insertion mutation, an in-frame deletion, an in-frame insertion, a missense mutation, a nonsense mutation, a stop codon mutation, or a splice site.

6. The method of claim 2, further comprising, prior to step (6): and according to the screened non-silent mutation type data, counting the non-silent mutation condition of the gene in the sample to obtain the matrix data of the non-silent mutation condition of the gene and the sample.

7. The device for analyzing the disease subtype affinity is characterized by comprising a gene non-silent mutation enrichment value calculation module, a gene non-silent mutation enrichment value calculation module and a gene mutation analysis module, wherein the gene non-silent mutation enrichment value calculation module is used for counting the non-silent mutation condition of a gene in a sample, and the gene non-silent mutation enrichment value is the ratio of the proportion of non-silent mutation of the gene in a disease subtype to the proportion of non-silent mutation of the gene in all samples;

the number of subgroups of disease subtypes is not less than 3.

8. The apparatus of claim 7, further comprising:

the sequencing module is used for acquiring sequencing data of all exons of the tumor and normal samples;

a sample body cell mutation condition analysis module;

the mutation site annotation module is used for screening the initial somatic cell mutation result and annotating the screened somatic cell mutation site;

the format conversion and mutation type annotation module is used for annotating mutation types;

a non-silent mutation type screening module for screening the mutation type as any one or combination of at least two of frameshift deletion mutation, frameshift insertion mutation, in-frame deletion, in-frame insertion, missense mutation, nonsense mutation, stop codon mutation or splice site;

and the intimacy calculation module is used for calculating the intimacy between the disease subtypes by using a hierarchical clustering method according to the gene non-silent mutation enrichment value.

9. The apparatus of claim 8, wherein the means for analyzing the genetic mutation in the sample comprises:

the sequencing data quality control unit is used for filtering the obtained sequencing data of the whole exome and screening the sequencing data of which the Q20 is more than or equal to 90 percent and the Q30 is more than or equal to 80 percent;

the sequence comparison unit is used for constructing a reference genome comparison index, comparing the quality-controlled data to a reference genome and obtaining the compared data;

the comparison data analysis unit is used for counting the proportion of the sequencing sequence which is compared to the reference genome in the sequencing data, calculating the depth, the average comparison quality and the coverage of the compared data, and counting the proportion, the average coverage depth and the coverage of the compared data in the target region of the reference genome;

the comparison data processing unit is used for obtaining the PCR repetitive sequence in the compared data, carrying out duplication removal on the PCR repetitive sequence, carrying out locus correction on the duplicated data and grouping the corrected data;

and the initial somatic mutation site acquisition unit is used for filtering the grouped data to obtain the initial somatic mutation with the mutation frequency of more than 5%.

10. Use of a device according to any of claims 7-9 for analyzing the relatedness of disease subtypes.

Technical Field

The invention belongs to the technical field of biological information analysis, and relates to a method for exploring disease subtype affinity by using genome data.

Background

Cancer is a group of diseases caused by disorders in the cellular classification and direction-regulating mechanisms, and usually presents as malignant tumors. Due to the poor accuracy of early diagnosis of cancer, high recurrence rate and mortality, it has become one of the serious threats to human health. In recent years, the occurrence and metastasis of tumors are recognized as the result of the continuous development of multigenic and multistep interaction, and the overall, comprehensive and dynamic research on the tumors is the fundamental way for preventing and treating the tumors. Different tumors exist in different subtypes, and different treatment strategies are clinically required for different tumor subtypes due to clinical heterogeneity of tumors. Nevertheless, the exploration of the affinity of tumor subtypes is also of great importance for the clinical treatment and prognosis of tumors.

At present, the main steps of the method for researching the affinity of tumor subtypes are as follows: 1) acquiring WES sequencing data according to a patient sample; 2) analyzing the somatic mutation condition of the patient according to the sequencing data; 3) annotation of the mutation sites; 4) carrying out format conversion on the annotation result and annotating the mutation type; 5) screening mutation types; 6) counting whether each gene has non-silent mutation in each sample; 7) according to the non-silent mutation statistical results of the samples, the affinity among the disease subtypes is calculated by using a hierarchical clustering method.

However, the prior art only counts the existence or nonexistence of non-silent mutation conditions in a sample, and cluster-analyzes the affinity of disease subtypes according to the statistical result directly without considering the influence of background factors such as the number of samples and the like on the result. Therefore, the result obtained by the existing method may have the problem of poor accuracy.

Therefore, a more accurate method for analyzing the intimacy of disease subtypes is provided, and the method has great significance in the fields of clinical treatment and prognosis monitoring of tumors.

Disclosure of Invention

Aiming at the defects and practical requirements of the prior art, the invention provides a method for exploring disease subtype affinity by using genome data, wherein before disease subtype affinity analysis is carried out, the method firstly calculates the non-silent mutation enrichment fraction of genes in each tumor subtype by using hyper-geometric distribution test, and then carries out affinity analysis by using the enrichment fraction, thereby reducing the influence of factors such as total number of samples and disease subtype samples on the analysis result.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for analyzing the relatedness of disease subtypes, said method comprising the steps of calculating a gene non-silent mutation enrichment value using a hypergeometric distribution test;

the calculation formula of the gene non-silent mutation enrichment value is as follows:

Figure BDA0002513874270000021

wherein n isfNumber of samples of genes with non-silent mutations in disease subtypes, NfThe number of samples of the gene which has non-silent mutation in all samples, N is the number of samples of disease subtype, and N is the total number of samples;

the number of subgroups of disease subtypes is not less than 3.

Before disease subtype affinity analysis, the invention firstly utilizes hyper-geometric distribution test to calculate the non-silent mutation enrichment fraction of the gene in each tumor subtype, and then utilizes the enrichment fraction to carry out affinity analysis, thereby reducing the influence of factors such as total number of samples, disease subtype sample number and the like on the analysis result and improving the accuracy of the method.

Preferably, the method comprises the steps of:

(1) sequencing the tumor and normal samples to obtain sequencing data of all exons;

(2) analyzing the somatic mutation condition of the sample according to the sequencing data;

(3) annotation of the mutation sites;

(4) carrying out format conversion on the annotation result, and annotating the mutation type;

(5) screening out non-silent mutation types according to the annotated mutation types;

(6) calculating the enrichment value of the non-silent mutation of the gene;

(7) according to the non-silent mutation enrichment value of the gene, the affinity between disease subtypes is calculated by using a hierarchical clustering method.

Preferably, the step of analyzing in step (2) is:

1) filtering the obtained sequencing data of the whole exons, and screening the sequencing data of which the Q20 is more than or equal to 90 percent and the Q30 is more than or equal to 80 percent;

2) constructing a comparison index for a reference genome, and comparing the sequencing data screened in the step 1) to the reference genome to obtain compared data;

3) counting the proportion of the sequencing sequences aligned to the reference genome in the sequencing data;

4) calculating the depth, average comparison quality and coverage of the compared data obtained in the step 2);

5) counting the proportion, the average coverage depth and the coverage of the compared data obtained in the step 2) in a target region of a reference genome;

6) marking the PCR repetitive sequence in the compared data obtained in the step 2), and removing the duplication of the PCR repetitive sequence;

7) carrying out locus correction on the data obtained in the step 6) after the duplication is removed;

8) grouping the corrected data in the step 7);

9) filtering the data grouped in the step 8) to obtain the initial somatic mutation with the variation frequency of more than 5%.

Preferably, step (3) comprises: and (3) screening the initial somatic mutation with the variation frequency of more than 5% obtained in the step (2), and annotating the screened somatic mutation.

Preferably, the non-silent Mutation type of step (5) includes any one of Frame Shift deletion Mutation (Frame _ Shift _ Del), Frame Shift insertion Mutation (Frame _ Shift _ Ins), In-Frame deletion (In _ Frame _ Del), In-Frame insertion (In _ Frame _ Ins), Missense Mutation (Missense _ Mutation), Nonsense Mutation (Nonsense _ Mutation), stop codon Mutation (Nonstop _ Mutation), or Splice Site (Splice _ Site), or a combination of at least two thereof.

Preferably, before the step (6), the method further comprises: and according to the screened non-silent mutation type data, counting the non-silent mutation condition of the gene in the sample to obtain the matrix data of the non-silent mutation condition of the gene and the sample.

In a second aspect, the present invention provides an apparatus for analyzing disease subtype affinity, the apparatus includes a gene non-silent mutation enrichment value calculation module for counting the non-silent mutation condition of a gene in a sample and calculating a formula according to the gene non-silent mutation enrichment value

Figure BDA0002513874270000041

Calculating a gene non-silent mutation enrichment value of the gene in the sample;

wherein n isfNumber of samples of genes with non-silent mutations in disease subtypes, NfThe number of samples of the gene which has non-silent mutation in all samples, N is the number of samples of disease subtype, and N is the total number of samples;

the number of subgroups of disease subtypes is not less than 3.

Preferably, the apparatus further comprises:

the sequencing module is used for acquiring sequencing data of all exons of the tumor and normal samples;

a sample body cell mutation condition analysis module;

the mutation site annotation module is used for screening the initial somatic cell mutation result and annotating the screened somatic cell mutation site;

a format conversion and mutation type annotation module;

a non-silent Mutation type screening module, which is used for screening the Mutation type as any one or combination of at least two of Frame Shift deletion Mutation (Frame _ Shift _ Del), Frame Shift insertion Mutation (Frame _ Shift _ Ins), In-Frame deletion (In _ Frame _ Del), In-Frame insertion (In _ Frame _ Ins), Missense Mutation (Missense _ Mutation), Nonsense Mutation (Nonsense _ Mutation), stop codon Mutation (Nonstop _ Mutation) or Splice Site (Splice _ Site);

and the intimacy calculation module is used for calculating the intimacy between the disease subtypes by using a hierarchical clustering method according to the gene non-silent mutation enrichment value.

Preferably, the gene mutation status analysis module of the sample comprises:

the sequencing data quality control unit is used for filtering the obtained sequencing data of the whole exome and screening the sequencing data of which the Q20 is more than or equal to 90 percent and the Q30 is more than or equal to 80 percent;

the sequence comparison unit is used for constructing a reference genome comparison index, comparing the quality-controlled data to a reference genome and obtaining the compared data;

the comparison data analysis unit is used for counting the proportion of the sequencing sequence which is compared to the reference genome in the sequencing data, calculating the depth, the average comparison quality and the coverage of the compared data, and counting the proportion, the average coverage depth and the coverage of the compared data in the target region of the reference genome;

the comparison data processing unit is used for obtaining the PCR repetitive sequence in the compared data, carrying out duplication removal on the PCR repetitive sequence, carrying out locus correction on the duplicated data and grouping the corrected data;

and the initial somatic mutation site acquisition unit is used for filtering the grouped data to obtain the initial somatic mutation with the mutation frequency of more than 5%.

In a third aspect, the present invention provides the use of a device according to the second aspect for analysing the relatedness of disease subtypes.

Compared with the prior art, the invention has the following beneficial effects:

before disease subtype affinity analysis, the invention firstly utilizes hyper-geometric distribution inspection to calculate the non-silent mutation enrichment fraction of genes in each tumor subtype, and then utilizes the enrichment fraction to carry out affinity analysis, thereby reducing the influence of factors such as total number of samples, disease subtype sample number and the like on the analysis result and improving the accuracy of the method.

Drawings

FIG. 1 is a flow chart of a method for analyzing the relatedness of disease subtypes;

FIG. 2 is a schematic diagram of an apparatus for analyzing the intimacy of disease subtypes;

FIG. 3A is the result of cluster analysis based on the counted number of sample mutations of each gene in disease subtypes, and FIG. 3B is the result of cluster analysis based on the mutation enrichment scores of each gene calculated by the method of the present invention.

Detailed Description

To further illustrate the technical means adopted by the present invention and the effects thereof, the present invention is further described below with reference to the embodiments and the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种变异序列的注释方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!