Screening method of mutant candidate gene

文档序号:139162 发布日期:2021-10-22 浏览:31次 中文

阅读说明:本技术 一种突变候选基因的筛选方法 (Screening method of mutant candidate gene ) 是由 刘珍 刘志岩 王海宁 于 2021-08-11 设计创作,主要内容包括:本发明涉及基因组测序分析技术领域,具体涉及一种突变候选基因的筛选方法,通过使用已有公共正常样本的数据构建正常突变分类模型,进行模拟突变发生,而不是仅仅使用已知突变位点以及总体的突变频率,避免直接使用数据结论造成的结果偏差。同时当研究样本数量很少或不存在对照样本时,无法构建突变分类模型,也可直接使用正常突变分类模型,进行突变模拟和计算。该方法可快速并准确地从大量的非同义突变数据中,寻找并筛选出具有重要意义的主效基因,以便进行下一步研究验证。(The invention relates to the technical field of genome sequencing analysis, in particular to a screening method of mutant candidate genes. Meanwhile, when the number of research samples is small or no control sample exists, a mutation classification model cannot be constructed, and a normal mutation classification model can be directly used for mutation simulation and calculation. The method can quickly and accurately find and screen out major genes with important significance from a large amount of non-synonymous mutation data so as to carry out the next research and verification.)

1. A screening method for a mutant candidate gene, comprising the steps of:

s1: and taking a normal healthy population sample in a public database as a control sample, taking a gene as a unit, classifying according to known non-synonymous mutation sites and combining base types at two sides, and constructing a mutation classification model as normal mutation classification model data.

And S2, taking the contrast sample data of the disease to be researched as a research sample, taking the gene as a unit, classifying the found mutation sites by combining base types at two sides, and combining the mutation sites with the normal mutation classification model data to be used as the final normal mutation classification model data.

S3: for each gene, traversing each coding region base site, and randomly predicting whether the gene is mutated or not and the mutation type according to the final normal mutation classification model data; and counting the number of non-synonymous mutations on the gene, and calculating the background non-synonymous mutation frequency of the specific gene of the real disease sample group in the simulated environment. And performing multiple simulations to obtain background non-synonymous mutation frequency of the gene under the multiple simulations.

S4: and calculating whether the background non-synonymous mutation frequency and the number of the non-synonymous mutations generated by the actual gene under multiple simulations are the major gene or not by applying binomial distribution.

2. The method of claim 1, wherein the screening for a mutant candidate gene comprises: the database in the step S1 is at least one of a 1000g database, a gnomAD database, an EXAC database, and an esp6500 database.

3. The method of claim 1, wherein the screening for a mutant candidate gene comprises: the database in the step S1 is a 1000g database.

Technical Field

The invention relates to the technical field of genome sequencing analysis, in particular to a screening method of a mutant candidate gene.

Background

With the development and maturation of the second-generation sequencing technology and the continuous decrease of the sequencing price, the second-generation sequencing has been increasingly applied to the research of human diseases, and the most extensive DNA sequencing is adopted. However, as the sequencing depth increases, more and more non-synonymous mutations are discovered in the research, and how to effectively screen out meaningful mutations and genes in a large number of non-synonymous mutations becomes a difficult point of research.

The current solutions mainly include: 1) screening and filtering through the existing known public databases (dbSNP, gnomAD and the like), but only a small part of nonsynonymous mutations can be removed due to the limitation of the population number of the public databases, and although the mutation number can be reduced, more meaningful genes cannot be accurately screened from the rest mutations; 2) the non-synonymous mutation is eliminated by adopting control research, the method is only suitable for diseases with control tissues (such as tumors), cannot be suitable for autoimmune diseases and more unknown complex diseases, and in addition, even if the control elimination can be adopted, a large amount of mutation data still exist for non-specific tumors and when the number of research samples is large, the main effective gene is difficult to accurately screen; 3) according to the background non-synonymous mutation rate calculated by the existing research, the genes are subjected to statistical analysis to find out the major genes, but although the method can evaluate the background mutation frequency, the calculation difference directly using the background mutation frequency is larger due to the huge difference of people and different diseases, so that the result is deviated.

Disclosure of Invention

In order to solve the above technical problems, the present invention aims to provide a method for screening mutant candidate genes, which can quickly and accurately search and screen out major genes having important significance from a large amount of non-synonymous mutation data, so as to perform the next research and verification.

In order to achieve the technical effect, the invention adopts the following technical scheme:

a screening method of a mutant candidate gene, comprising steps S1-S4:

s1: and taking a normal healthy population sample in a public database as a control sample, taking a gene as a unit, classifying according to known non-synonymous mutation sites and combining base types at two sides, and constructing a mutation classification model as normal mutation classification model data.

And S2, taking the contrast sample data of the disease to be researched as a research sample, taking the gene as a unit, classifying the found mutation sites by combining base types at two sides, and combining the mutation sites with the normal mutation classification model data to be used as the final normal mutation classification model data.

S3: for each gene, traversing each coding region base site, and randomly predicting whether the gene is mutated or not and the mutation type according to the final normal mutation classification model data; and counting the number of non-synonymous mutations on the gene, and calculating the background non-synonymous mutation frequency of the specific gene of the real disease sample group in the simulated environment. And after multiple times of simulation, obtaining the background non-synonymous mutation frequency of the gene under multiple times of simulation.

S4: and calculating whether the background non-synonymous mutation frequency and the number of the non-synonymous mutations generated by the actual gene under multiple simulations are the major gene or not by applying binomial distribution.

Further, the database in the step S1 is at least one of a 1000g database, a gnomAD database, an EXAC database, and an esp6500 database. Preferably a 1000g database. Compared with the prior art, the invention has the beneficial effects that:

in a first aspect, the screening method for candidate mutant genes provided by the invention constructs a normal mutation classification model by using data of an existing public normal sample, and simulates mutation generation instead of only using known mutation sites and overall mutation frequency, thereby avoiding result deviation caused by directly using data conclusion. Meanwhile, when the number of research samples is small or no control sample exists, a mutation classification model cannot be constructed, and a normal mutation classification model can be directly used for mutation simulation and calculation.

In a second aspect, the screening method of the mutant candidate gene provided by the invention combines the actual mutation of the research sample (adopting the nonsynonymous mutation of the control sample) with the normal mutation classification model by using the algorithm, so as to construct the final normal mutation classification model, and the simulated mutation data can be prepared to reflect the actual mutation condition of the sick population in the specific disease environment.

In a third aspect, the screening method of the mutant candidate gene provided by the invention is used for summarizing and counting the mutant base sites and the base sites on two sides as a unit by constructing a mutation classification model, and can reflect the actual occurrence environment of mutation by combining with the actual sequence environment.

In a fourth aspect, the screening method of the mutant candidate gene provided by the invention can be used for carrying out multiple times of simulation, calculating the background non-synonymous mutation frequency according to the multiple times of simulation, avoiding the experimental deviation caused by small sample data volume or accidental simulation, and enabling the result to be more stable and credible.

Drawings

FIG. 1 is a flow chart of the construction of the final normal mutation classification model data provided by the present invention;

FIG. 2 is a flow chart of a mutation classification model simulation provided by the present invention;

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

As shown in FIG. 1-2, the screening method of a mutant candidate gene provided by the present invention comprises steps S1-S4, wherein:

and step S1, using 1000G of sample data of normal healthy people, utilizing known non-synonymous mutation, combining base groups on two sides, and carrying out classification statistics on each mutation point according to different changes to construct a normal mutation classification model.

The method specifically comprises the following steps: assuming that a certain site C (left base) -G (site) -G (right base) of a certain gene coding region has non-synonymous mutation at the site in a 1000G normal human sample, wherein the number of changed samples with G > A is found to be 13, the number of changed samples with G > T is found to be 8, the number of changed samples with G > C is found to be 4, and the number of unchanged samples is 5, the classification information of the site is counted according to four possible changing modes of the bases, namely: CGG > CAG13 (human), CGG > CTG8 (human), CGG > CCG4 (human), CGG > CGG5 (human).

And carrying out classified statistic summary on the non-synonymous mutation of all normal samples according to the rule. If there are two base sites located at different positions on a gene, but the bases are the same, the bases on both sides are also the same, such as two base sites on gene TP 53:

site 1: chr17-7579707G (left base) -T (site) -T (right base),

classification information list of site 1 mutations [ GTT > GAT3, GTT > GTT17, GTT > GCT10, GTT > GGT0],

site 2: chr17-7579710G (left base) -T (site) -T (right base),

list of site 2 mutation classification information [ GTT > GAT1, GTT > GTT24, GTT > GCT0, GTT > GGT5 ].

They were added and the final mutation classification information was listed as [ GTT > GAT3+1, GTT > GTT17+24, GTT > GCT10+0, GTT > GGT0+5 ].

Step S2: for the research sample, a mutation classification model is constructed by using non-synonymous mutation in the control sample, and is added with the normal mutation classification model to increase the robustness of the model.

The method specifically comprises the following steps: assuming that non-synonymous mutation is found in a certain site C (left base) -G (site) -G (right base) of a certain gene coding region in a study control sample, wherein the number of changed samples with G > A is 8, the number of changed samples with G > T is 2, the number of changed samples with G > C is 0, and the number of unchanged samples is 20. Then the site is counted according to the possible four ways of base change, namely: CGG > CAG8 (human), CGG > CTG2 (human), CGG > CCG0 (human), CGG > CCG20 (human).

And (4) carrying out classified statistic summary on the non-synonymous mutations of all the control samples according to the rule. If two base sites located at different positions on a gene have the same base, the bases on both sides of the gene are also the same, such as two base sites on gene TP 53:

site 1: chr17-7579707G (left base) -T (site) -T (right base),

classification information list of site 1 mutations [ GTT > GAT3, GTT > GTT17, GTT > GCT10, GTT > GGT0],

site 2: chr17-7579710G (left base) -T (site) -T (right base),

list of site 2 mutation classification information [ GTT > GAT1, GTT > GTT24, GTT > GCT0, GTT > GGT5 ].

They were added and the final mutation classification information was listed as [ GTT > GAT3+1, GTT > GTT17+24, GTT > GCT10+0, GTT > GGT0+5 ].

Combining the mutation classification model data of the research sample with the normal mutation classification model data of a 1000G normal human sample, namely:

CGG > CAG13+8 (human), CGG > CTG8+2 (human), CGG > CCG4+0 (human), CGG > CGG5+20 (human).

Step S3: and traversing each base of the gene coding region by using a mutation classification model on each gene, randomly acquiring mutation changes (the more times in the mutation classification model indicate that the probability of getting is higher, but not one hundred percent of the mutation changes), and predicting whether mutation occurs.

The method specifically comprises the following steps: assuming that the coding region of a gene is 5bp in length, i.e., contains 5 bases-c (intron) AGTCAg (intron), traversing each base, i.e.: one base and bases on both sides of the base are taken as a unit, and the unit list is [ cAG, AGT, GTC, TCA, CAg ].

According to the constructed normal mutation classification model, a unit comprising one base and two sides of the base has the possibility of changing four bases, such as:

[cAG>cAG10,cAG>cTG3,cAG>cCG1,cAG>cGG1,

AGT>AAT0,AGT>ATT3,AGT>ACT2,AGT>AGT10,

GTC>GAC1,GTC>GAC9,GTC>GCC2,GTC>GGC3,

TCA>TAA3,TCA>TTA3,TCA>TCA8,TCA>TGA1,

CAg>CAg12,CAg>CTg1,CAg>CCg0,CAg>CGg2,]

the units traversed are searched in a normal mutation classification model to find units with completely consistent bases, such as the first cAG finding [ cAG > cAG10, cAG > cTG3, cAG > cCG1, cAG > cGG1 ].

Using [ cAG > cAG10, cAG > cTG3, cAG > cCG1, cAG > cGG1] as a list, wherein 10, 3, 1 represent the number of times such changes occur in the list; values are taken randomly from the list, such as to cGG.

2) If cAG > cGG is a nonsynonymous mutation type in the coding region of the gene, the type is marked as 1, and other types are marked as 0; counting and predicting non-synonymous mutation frequency of the gene background under simulation; multiple simulations were performed and the background non-synonymous mutation frequency was calculated for each simulation.

The length of the coding region of the reference gene is 5bp, namely the total length of the coding base of the gene is 5; the mutagenesis was simulated according to the above traversal, with a total of 2 non-synonymous mutagenizations of 5 bases, namely:

the non-synonymous mutation frequency of the gene background is 2/5;

according to the above process, mutation simulation can be performed a plurality of times, and the background non-synonymous mutation frequency is calculated each time.

Step S4: according to the background non-synonymous mutation frequency calculated by multiple times of simulation, the non-synonymous mutation frequency of the actual gene is calculated at the same time, namely the non-synonymous mutation number actually generated by the gene is divided by the length of the gene coding region, and the P value is calculated by utilizing binomial distribution, thereby judging whether the gene is the main effective gene or not.

According to the above simulation, the results of the background nonsynonymous mutation frequency tabulation calculated by 5 times of simulation are [0.3,0.1,0.5,0.2,0.2], and the background mutation frequency (assumed to be 0.4) actually occurred in the gene.

Comparing the actual frequency with the simulation calculation frequency, finding that the 1-time simulation background mutation frequency is more than 0.4, and performing binomial distribution calculation:

wherein k is the number of times that the actual background mutation frequency is greater than the frequency of the simulation calculation, in the example 1 time;

n is the number of simulations, 5 in the example;

and p is the probability that the actual background mutation frequency is greater than the simulated background mutation frequency under the simulation condition, and the fixed probability is 0.5.

If the calculated P value is less than 0.05, it indicates that the gene is an important or major gene, i.e., the number of nonsynonymous mutations that actually occur is a result of a selectivity (e.g., the occurrence of disease) rather than a random occurrence.

All calculated gene P values are ranked, with smaller P values indicating greater gene role in the development of disease.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims. The techniques, shapes, and configurations not described in detail in the present invention are all known techniques.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于探针封闭与解封的DNA杂交信息存储加密方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!