Pathogenic gene locus database and establishment method thereof

文档序号:1088703 发布日期:2020-10-20 浏览:17次 中文

阅读说明:本技术 致病基因位点数据库及其建立方法 (Pathogenic gene locus database and establishment method thereof ) 是由 刘晶星 于世辉 喻长顺 于 2020-06-30 设计创作,主要内容包括:本发明涉及一种致病基因位点数据库及其建立方法,属于疾病基因检测技术领域。该致病基因位点数据库的建立方法包括以下步骤:获取经临床验证的致病基因位点数据信息,作为参考数据;获取所述参考数据中由于氨基酸改变致病的基因位点,并对此位点氨基酸的密码子进行扩展;获取所述参考数据中由于剪切位点改变致病的基因位点,并对此位点的其它突变形式进行扩展;对上述数据进行筛选,剔除人群突变发生频率高于预定阈值的位点,剩余高风险致病突变位点和高风险致病剪切位点,与所述参考数据组合,即组成所述致病基因位点数据库。该数据库收录了大量致病风险很高的位点记录,可以减少遗漏的可能性,大大提高了临床解读工作的准确性和效率。(The invention relates to a pathogenic gene locus database and an establishment method thereof, belonging to the technical field of disease gene detection. The method for establishing the pathogenic gene locus database comprises the following steps: acquiring clinically verified pathogenic gene locus data information as reference data; acquiring a gene site causing diseases due to amino acid change in the reference data, and expanding codons of amino acids at the site; acquiring a gene site causing diseases due to the change of the shearing site in the reference data, and expanding other mutation forms of the site; and screening the data, removing the sites with the occurrence frequency of the crowd mutation higher than a preset threshold value, and combining the residual high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database. The database collects a large number of site records with high risk of disease, so that the possibility of omission can be reduced, and the accuracy and efficiency of clinical interpretation work are greatly improved.)

1. A method for establishing a pathogenic gene locus database is characterized by comprising the following steps:

acquiring reference data: acquiring clinically verified pathogenic gene locus data information as reference data;

expansion to obtain mutation site data: acquiring a gene site causing disease due to amino acid change in the reference data, expanding codons of the amino acid at the site, analyzing the preset mutation generation condition to obtain high-risk disease causing mutation site data, and counting for later use;

extension to obtain cleavage site data: acquiring a gene site which is caused by the change of the shearing site in the reference data, expanding other mutation forms of the site to obtain high-risk pathogenic shearing site data, and counting for later use;

and (3) screening expansion sites: and screening the obtained high-risk pathogenic mutation site data and the high-risk pathogenic shearing site data, rejecting the sites with the mutation occurrence frequency higher than a preset threshold value of the population, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database.

2. The method for creating a database of pathogenic loci according to claim 1, wherein the reference data is derived from the HGMD database and/or the ClinVar database.

3. The method for creating a pathogenic loci database according to claim 1, wherein in the step of expanding to obtain mutation loci data, the predetermined mutation occurrence conditions include the following three types:

the I-type mutation is that the amino acid corresponding to the mutated codon is consistent with the reference data;

the II type mutation is a codon after mutation and is a stop codon;

the III-type mutation is that the amino acid corresponding to the mutated codon is inconsistent with the reference data and is not a stop codon.

4. The method of claim 3, wherein the database of pathogenic loci is determined as a class I mutation when both of the class I mutation and the class II mutation are satisfied.

5. The method for creating a pathogenic gene locus database according to claim 3, wherein in the step of obtaining the splicing locus data by expansion, the splicing locus Is specifically expanded by mutating the mutation locus in the reference data into a nucleotide different from the reference data, namely, Is an Is-type mutation.

6. The method for creating a database of pathogenic loci according to claim 1, wherein in the extended locus screening step, the predetermined threshold is 5%.

7. The method for creating a database of pathogenic loci according to claim 1, wherein in the step of screening for extended loci, the following filtering is performed for loci where there is no definite occurrence frequency of human mutations and for loci where the occurrence frequency of human mutations screened is lower than a predetermined threshold:

1) searching a sample with the site in a local sample library, and if the number of samples is less than a preset number of samples, reserving the site as a high-risk pathogenic site; if the number of samples is more than or equal to the preset number of samples, judging that the samples are to be confirmed, and carrying out the next step;

2) and acquiring clinical information corresponding to the sample to be confirmed, if the clinical information of the sample with the ratio exceeding the preset ratio is related to the gene function of the site, reserving the site as a high-risk pathogenic site, and if the clinical information of the sample with the ratio less than or equal to the preset ratio is related to the gene function of the site, rejecting the site.

8. The method for creating a database of pathogenic loci according to claim 7, wherein the predetermined number of samples is 10 and the predetermined ratio is 1/3.

9. The database of pathogenic gene loci obtained by the method for creating a database of pathogenic gene loci according to any one of claims 1 to 8.

10. An automatic analysis system for a pathogenic gene, comprising:

the data acquisition module is used for acquiring gene detection data of a sample to be detected;

a data analysis module, which Is used for substituting the gene detection data into the pathogenic gene locus database of claim 5 for comparison after bioinformatics analysis, so as to obtain the information of I-type mutation, II-type mutation, III-type mutation and/or Is-type mutation in the sample to be detected;

and the judgment output module is used for outputting the site mutation information according to risk grades, wherein the risk grades sequentially comprise: class I mutations, class Is mutations, class II mutations, class III mutations.

Technical Field

The invention relates to the technical field of disease gene detection, in particular to a pathogenic gene locus database and an establishment method thereof.

Background

Genetic mutation polymorphism and pathogenicity, about 400 ten thousand mutations exist on the genome of each person, most of the mutations are normal non-pathogenic sites, namely polymorphic sites, and pathogenic sites need to be verified through a complex process and are a long-term accumulation process.

At present, a plurality of databases for recording pathogenic loci, such as HGMD, ClinVar and the like, exist, but the recorded databases are all actually generated mutations, namely mutations supported by real sample cases, and are obtained after comparison and verification with clinical symptoms, namely, most of the recorded loci in the databases are more common loci.

In practice, since it is difficult to collect a sufficient number of samples from uncommon loci for pathogenicity studies, they are not included in the database, but because of the diversity of genetic mutations and disease symptom relationships (different mutations of the same gene may cause different symptoms) and heterogeneity (one symptom may be caused by multiple different genetic mutations), the proportion of the pathogenic loci that have been found at present is very low, i.e., the significance of many mutations is unknown, although these single rare loci are rare, but in large amounts.

The data with unverified meanings play a very important role in the detection of pathogenic gene mutation, if the genetic detection is carried out only by relying on common loci recorded in a database, a plurality of meaningful loci can be ignored, the influence on the complex heterozygous pathogenic gene is very large, the difficulty of the detection work is greatly increased, and the diagnosis efficiency is reduced.

Disclosure of Invention

In view of the above, there is a need to provide a disease-causing gene locus database which can mine unverified high-risk loci for later use and make it easier for an analyst to find the existence of such loci by increasing the risk weight of the detected mutation loci when analyzing the loci, thereby reducing the difficulty of detection and improving the efficiency of diagnosis.

A method for establishing a pathogenic gene locus database comprises the following steps:

acquiring reference data: acquiring clinically verified pathogenic gene locus data information as reference data;

expansion to obtain mutation site data: acquiring a gene site causing disease due to amino acid change in the reference data, expanding codons of the amino acid at the site, analyzing the preset mutation generation condition to obtain high-risk disease causing mutation site data, and counting for later use;

extension to obtain cleavage site data: acquiring a gene site which is caused by the change of the shearing site in the reference data, expanding other mutation forms of the site to obtain high-risk pathogenic shearing site data, and counting for later use;

and (3) screening expansion sites: and screening the obtained high-risk pathogenic mutation site data and the high-risk pathogenic shearing site data, rejecting the sites with the mutation occurrence frequency higher than a preset threshold value of the population, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database.

The inventor finds in practice that since all the sites recorded in the database of various pathogenic sites are sites which occur and are verified in real samples, a large number of sites which are associated with the sites have high pathogenic risks, and although the sites with high risk are not verified, the sites can be mined out for later use by the method, so that the detection difficulty is reduced, and the diagnosis efficiency is improved.

It is understood that in the step of expanding the data of the mutation sites, the amino acid is considered to change the pathogenic gene site, and the core is to consider the single base substitution mutation site, thereby changing the codon of the amino acid, and finally changing the amino acid, thereby causing the disease. Therefore, the occurrence of the predetermined mutation is classified and analyzed according to the codon corresponding to the amino acid and the possible occurrence of the single-base substitution. If an amino acid corresponds to 3 codons, a maximum of 9 codon patterns are possible in a permutation and combination manner, and then the corresponding amino acid (or stop codon) is corresponded thereto, thereby analyzing and evaluating the risk of the disease at the site.

In one embodiment, the reference data is derived from the HGMD database and/or the ClinVar database. It will be appreciated that the reference data source is not limited and need only be a database that is as authoritative and comprehensive as possible.

In one embodiment, in the step of expanding to obtain mutation site data, the predetermined mutation generation conditions include the following three types:

the I-type mutation is that the amino acid corresponding to the mutated codon is consistent with the reference data;

the II type mutation is a codon after mutation and is a stop codon;

the III-type mutation is that the amino acid corresponding to the mutated codon is inconsistent with the reference data and is not a stop codon.

It will be appreciated that the above-mentioned class III is also a missense mutation other than class I and class II.

In one embodiment, a class I mutation is determined when both class I and class II mutations are satisfied. It will be appreciated that the case of class I and class II is satisfied at the same time, i.e.the case where the mutation in the original database is a stop mutation. Therefore, when an extended new mutation is also a terminating mutation, it is preferentially judged as class I. It can also be understood that class II is defined as a class that extends the classification defined for the stop mutation when the mutation in the original database is not a stop mutation, and that class II is less at risk of disease than class I.

In one embodiment, in the step of obtaining the cut site data by expansion, the cut site expansion Is specifically to mutate the mutation site in the reference data into a nucleotide different from the reference data, that Is, an Is-type mutation.

In one embodiment, in the extended site screening step, the predetermined threshold is 5%. The inventor screens and adjusts the data of the unit large sample, finally finds that the method has better effect by taking 5% as a threshold value, can show possible high-risk sites as much as possible, and can avoid reducing risk prompting significance caused by too much nonsense mutation.

In one embodiment, in the extended site screening step, the following filtering is performed for sites without definite human mutation occurrence frequency and sites with human mutation occurrence frequency lower than a predetermined threshold value after screening:

1) searching a sample with the site in a local sample library, and if the number of samples is less than a preset number of samples, reserving the site as a high-risk pathogenic site; if the number of samples is more than or equal to the preset number of samples, judging that the samples are to be confirmed, and carrying out the next step;

2) and acquiring clinical information corresponding to the sample to be confirmed, if the clinical information of the sample with the ratio exceeding the preset ratio is related to the gene function of the site, reserving the site as a high-risk pathogenic site, and if the clinical information of the sample with the ratio less than or equal to the preset ratio is related to the gene function of the site, rejecting the site.

It can be understood that, due to the existence of biological polymorphism, if all mutations related to the verification mutation are considered as high-risk pathogenic sites and are included in the database, the significance of risk reduction prompt can be caused, therefore, the site data primarily screened out should be filtered, and only the high-risk sites are reserved, so that the application value of establishing the pathogenic gene site database is increased.

In one embodiment, the predetermined number of samples is 10 and the predetermined ratio is 1/3. The inventor screens and adjusts the unit large sample data, and finally finds that the database is established by the parameters, so that the method has a good effect.

The invention also discloses a pathogenic gene locus database obtained by the establishment method of the pathogenic gene locus database.

The invention also discloses an automatic pathogenic gene analysis system, which comprises:

the data acquisition module is used for acquiring gene detection data of a sample to be detected;

the data analysis module Is used for substituting the gene detection data into the pathogenic gene locus database for comparison after bioinformatics analysis to obtain information of I-type mutation, II-type mutation, III-type mutation and/or Is-type mutation in a sample to be detected;

and the judgment output module is used for outputting the site mutation information according to risk grades, wherein the risk grades sequentially comprise: class I mutations, class Is mutations, class II mutations, class III mutations.

Compared with the prior art, the invention has the following beneficial effects:

according to the method for establishing the pathogenic gene locus database, the pathogenic gene locus data are enriched by mutation expansion of amino acid change and mutation expansion of the shearing locus, and the expanded loci are removed and screened, so that the pathogenic gene locus database which not only enriches high-risk pathogenic loci but also has good practical value is obtained. Thereby making it easier for the analyst to discover the existence of these other risk of pathogenesis associated with validating the site of pathogenesis, thereby reducing the difficulty of detection and increasing the efficiency of diagnosis.

The pathogenic gene locus database of the invention collects a large number of locus records with high pathogenic risk, and the high-risk pathogenic locus can be quickly positioned by matching and analyzing the gene detection locus and the locus records, thereby reducing the possibility of omission and greatly improving the accuracy and efficiency of clinical interpretation work.

The pathogenic gene site database can be used in an automatic pathogenic gene analysis system, mutation sites with possible pathogenic risks are obtained by analyzing through an automatic analysis process, the requirements on the experience of analysts in the process of letter generation are reduced, the detection and analysis difficulty is reduced, and the diagnosis efficiency is improved.

Drawings

FIG. 1 is a table of amino acid codons.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The data used in the following examples were collected and collated in daily samples from the company.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:入组生酮饮食临床研究患者的首选分子分型的构建方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!