Method for establishing genetic risk assessment model of focal epilepsy

文档序号:117074 发布日期:2021-10-19 浏览:23次 中文

阅读说明:本技术 一种局灶性癫痫遗传风险评估模型的建立方法 (Method for establishing genetic risk assessment model of focal epilepsy ) 是由 张晓芳 王佳 王小冬 梁萌萌 于 2021-07-13 设计创作,主要内容包括:本发明公开了一种局灶性癫痫遗传风险评估模型的建立方法,包括以下步骤:(1)选择局灶性癫痫患者入组,采集癫痫患者样本并且对样本进行基因分型并且对分型结果进行质控,确定入组样本,建立研究队列,分为训练集和测试集;(2)构建局灶性癫痫遗传风险位点数据库;(3)基于步骤(2)中的局灶性癫痫遗传风险位点数据库,基于基因分型,选择模型的特征和统计所述特征的数量,生成所述训练集和所述测试集这两个队列数据集的特征矩阵;假设参数多基因遗传风险评分,构建局灶性癫痫遗传风险评估模型并且进行模型训练。提供适用于中国人群的局灶性癫痫患病遗传风险评分模型。(The invention discloses a method for establishing a genetic risk assessment model of focal epilepsy, which comprises the following steps: (1) selecting focal epileptics to group, collecting epileptic samples, carrying out genotyping on the samples, carrying out quality control on typing results, determining grouped samples, establishing a research queue, and dividing the research queue into a training set and a test set; (2) constructing a focal epilepsy genetic risk locus database; (3) based on the focal epilepsy genetic risk locus database in the step (2), based on genotyping, selecting the characteristics of a model and counting the number of the characteristics, and generating a characteristic matrix of two queue data sets of the training set and the test set; and (3) assuming parameter polygene genetic risk scores, constructing a focal epilepsy genetic risk assessment model and carrying out model training. Provides a genetic risk score model for the occurrence of the focal epilepsy, which is suitable for Chinese population.)

1. A method for establishing a genetic risk assessment model of focal epilepsy is characterized by comprising the following steps:

(1) selecting focal epileptics and non-diseased contrast population to group, collecting grouped population samples and genotyping the samples, then performing quality control based on all sample typing results, determining the final grouped samples, establishing a research queue, and dividing the research queue into a training set and a test set;

(2) constructing a focal epilepsy genetic risk locus database;

(3) and (3) constructing a multivariate focal epilepsy genetic risk assessment model based on the focal epilepsy genetic risk locus database in the step (2).

2. The method for establishing the genetic risk assessment model for focal epilepsy according to claim 1, wherein in step (1), the population samples to be grouped comprise a case group and a control group, and the grouping criteria of the case group are as follows: according to epilepsy clinical diagnosis and treatment guidelines established by the International epileptic Association (ILAE), the epilepsy is diagnosed as focal epilepsy by two or more neurologists; the age is greater than 2 years and less than 90 years; no combined mental complications; no history of false seizures; history of smokeless wine abuse; absence of psychogenic or systemic degenerative changes; no relationship with other members entering the group;

the group entry standard of the control group is as follows: healthy and non-psychiatric diseases; the age is greater than 2 years and less than 90 years; the race is consistent with the case queue; no combined mental complications; no history of false seizures; history of smokeless wine abuse; absence of psychogenic or systemic degenerative changes; there is no relationship with other members of the group.

3. The method for establishing the genetic risk assessment model for focal epilepsy according to claim 2, wherein in the step (1), genotyping is performed on the sample and the genotyping result is controlled, comprising the following steps:

A. carrying out whole-gene sequencing on all collected samples of the case group and the control group, carrying out quality control on original sequencing data, carrying out BWA software sequence comparison, comparison data processing and SNP/Indel mutation detection analysis on GATK software, and obtaining a mutation vcf file, wherein the comparison data processing comprises sequencing genes and removing repeated sequences;

B. performing population data quality control on the result of the Avcf file in the step by using software plink, and removing individuals with genotyping deletion rate higher than 0.05, individuals with high heterozygosity and individuals with genetic relationship in the case group and the control group;

C. determining grouping samples in the case group and the control group, establishing a research queue, randomly dividing all grouping samples into a training set and a testing set, wherein the ratio of the training set to the testing set is 7:3, and respectively merging the genotyping of the training set samples and the genotyping of the testing set samples into storage files.

4. The method for establishing the genetic risk assessment model for focal epilepsy according to claim 1, wherein in the step (2), a summary file of international anti-epilepsy association-dominated large-scale GWAS meta analysis is used to construct a focal epilepsy genetic risk site database containing information and effect values of a plurality of genetically related sites.

5. The method for establishing the genetic risk assessment model for focal epilepsy according to claim 1, wherein in the step (3), the step of establishing the genetic risk assessment model for focal epilepsy comprises the following steps:

a. based on genotyping, selecting features of a model and counting the number of the features to generate a feature matrix of two queue data sets of the training set and the test set;

b. and (3) assuming parameter polygene genetic risk scores, constructing a focal epilepsy genetic risk assessment model and carrying out model training.

Technical Field

The invention belongs to the field of genetic risk assessment, and particularly relates to a method for establishing a genetic risk assessment model of focal epilepsy.

Background

Currently, in the field of precise medicine, genetic molecular diagnosis of epilepsy mainly relies on secondary sequencing (genetic package, whole exon sequencing, etc.) combined with genetic variation interpretation to identify diseased genes and variations. This approach has a limited diagnostic rate in patients with focal epilepsy. A number of previous studies have shown that only very individual types of focal epilepsy can be explained by a single gene/variation (see literature PMID: 30568546). On the other hand, multiple GWAS studies have shown that the common type of focal epilepsy is associated with low-effector risk alleles at polymorphic sites (see literature PMID: 24014518).

A Polygenic Risk Score (PRS) is a weighted accumulation of low-effector risk alleles of a subject that can be used for genetic risk assessment of a disease in an individual. Wherein, the weight of the allele is obtained according to the result of the whole genome association analysis research (GWAS) of the prior related diseases. PRS is widely used for genetic risk assessment of mental diseases at present. In the field of epilepsy, Marie et al (see PMID:33090489) attempted to assess genetic heterogeneity of focal epilepsy using the ideas of PRS to account for clinical heterogeneity of focal epilepsy. But no scholars currently apply PRS for risk stratification and assisted diagnosis of focal epilepsy.

Based on the multigene risk score, only genetic factors of the disease are evaluated, and thus the efficacy of the method for evaluation is limited by the heritability of the disease. On the other hand, when calculating PRS, the weight of the risk site is premised on the results of previous GWAS studies, and therefore the assessment efficacy is also influenced by the efficacy of existing GWAS studies, such as "cursing winners" and population bias.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to provide a method for establishing a genetic risk assessment model of focal epilepsy, and overcomes the defect that PRS (general purpose signal) is not applied to risk stratification and auxiliary diagnosis of focal epilepsy at present.

In order to achieve the above object, the present invention provides a method for establishing a genetic risk assessment model of focal epilepsy, comprising the following steps:

(1) selecting focal epileptics and non-diseased contrast population to group, collecting grouped population samples and genotyping the samples, then performing quality control based on all sample typing results, determining the final grouped samples, establishing a research queue, and dividing the research queue into a training set and a test set;

(2) constructing a focal epilepsy genetic risk locus database;

(3) and (3) constructing a multivariate focal epilepsy genetic risk assessment model based on the focal epilepsy genetic risk locus database in the step (2).

Preferably, in step (1), the population sample comprises a case group and a control group, and the inclusion criteria of the case group are as follows: according to epilepsy clinical diagnosis and treatment guidelines established by the International epileptic Association (ILAE), the epilepsy is diagnosed as focal epilepsy by two or more neurologists; the age is greater than 2 years and less than 90 years; no combined mental complications; no history of false seizures; history of smokeless wine abuse; absence of psychogenic or systemic degenerative changes; no relationship with other members entering the group;

the group entry standard of the control group is as follows: healthy and non-psychiatric diseases; the age is greater than 2 years and less than 90 years; the race is consistent with the case queue; no combined mental complications; no history of false seizures; history of smokeless wine abuse; absence of psychogenic or systemic degenerative changes; there is no relationship with other members of the group.

Further, in the step (1), genotyping is carried out on the sample and the quality control is carried out on the genotyping result, and the method comprises the following steps:

A. carrying out whole-gene sequencing on all collected samples of the case group and the control group, carrying out quality control on original sequencing data, carrying out BWA software sequence comparison, comparison data processing and SNP/Indel mutation detection analysis on GATK software, and obtaining a mutation vcf file, wherein the comparison data processing comprises sequencing genes and removing repeated sequences;

B. performing data quality control on the result of the vcf file in the step A by using software plink, and removing individuals with genotyping deletion rate higher than 0.05, individuals with high heterozygosity and individuals with genetic relationship in the case group and the control group;

C. determining grouping samples in the case group and the control group, establishing a research queue, randomly dividing all grouping samples into a training set and a testing set, wherein the ratio of the samples in the training set to the samples in the testing set is 7:3, and respectively merging the genotypes of the samples in the training set and the testing set into storage files.

Further, in the step (2), a summary file of large-scale GWAS meta analysis of epilepsy dominated by international antiepileptic alliance is used for constructing a focal epilepsy genetic risk site database containing a plurality of genetically related site information and effect values.

Further, in the step (3), constructing a genetic risk assessment model of the focal epilepsy, comprising the following steps:

a. based on genotyping, selecting features of a model and counting the number of the features to generate a feature matrix of two queue data sets of the training set and the test set;

b. and (3) assuming parameter polygene genetic risk scores, constructing a focal epilepsy genetic risk assessment model and carrying out model training.

The method for establishing the genetic risk assessment model of the focal epilepsy, provided by the invention, has the following beneficial effects:

the genetic risk evaluation model is suitable for the genetic risk evaluation model of the Chinese population suffering from the focal epilepsy, the epilepsy is a disease with stronger clinical phenotype and etiology heterogeneity, and the patent focuses on the focal epilepsy with lower genetic diagnosis rate and higher inheritance degree to evaluate the genetic risk and provide genetic diagnosis.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for establishing a genetic risk assessment model of focal epilepsy according to this embodiment.

Fig. 2 is an analysis flowchart of whole-gene sequencing performed on all collected samples of a case group and a control group in step 1(3) a of the method for establishing a genetic risk assessment model for focal epilepsy in the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.

As shown in fig. 1, a method for establishing a genetic risk assessment model of focal epilepsy includes the following steps:

1. patients with focal epilepsy were selected for cohort, patient samples were collected and the samples were genotyped.

(1) And (4) selecting focal epilepsy patients to group, and establishing a research queue.

The case group has the grouping standard of a, and is diagnosed as focal epilepsy according to epilepsy clinical diagnosis and treatment guidelines established by the International epileptic Association (ILAE) by two or more neurologists; b. the age is greater than 2 years and less than 90 years. Exclusion criteria were a, combined psychiatric complications; b. a history of pseudoseizure; c. history of tobacco and wine abuse; d. presence of psychogenic or systemic degenerative disorders; e. there is a relationship with other members of the group.

The group entry standard of the control group is a, healthy and non-mental diseases; b. the age is greater than 2 years and less than 90 years; c. ethnicity is consistent with the case cohort. The exclusion criteria were the same as the case group.

(2) And arranging the information data of the group object.

According to the principle of informed consent of the members who enter the group, the members who enter the group are subjected to peripheral blood sample collection and basic information data arrangement, 1300 cases of the members who enter the group and 1400 cases of the members who control the group are selected.

(3) Genotyping and quality control of the typing result are carried out, and the final group entry sample is determined.

A. And (3) carrying out whole-gene sequencing on all collected samples of the case group and the control group, carrying out quality control on original sequencing data, carrying out analysis such as BWA (Burrow-Wheeler Aligner) software sequence comparison, comparison data processing (sequencing and removing repeat sequence clean data and the like), carrying out SNP/Indel mutation detection on GATK software, and obtaining a mutation vcf file. The analysis flow is shown in fig. 2.

B. And (3) performing data quality control on the vcf result obtained in the step A by using software plink 1.9, and removing individuals with genotyping deletion rate higher than 0.05, high heterozygosity individuals and related individuals. The method comprises the following specific steps:

and (3) genotyping deletion rate quality control:

plink--vcf all.vcf.gz--make-bed--out genotypes

plink--bfile genotypes--geno 0.05--make-bed--out genotypes

removal of high heterozygote individuals:

plink--bfile genotypes--exclude inversion.txt--range--indep-pairwise 50

5 0.2--out indepSNP

Plink--bfile genotypes--extract indepSNP.prune.in--het--out R_check

Rscript--no-save check_heterozygosity_rate.R

Rscript--no-save heterozygosity_outliers_list.R

sed's/"//g'fail-het-qc.txt|awk'{print$1,$2}'>het_fail_ind.txt

plink--bfile genotypes--remove het_fail_ind.txt--make-bed--out

genotypes

removing the related individuals:

plink--bfile genotypes--extract indepSNP.prune.in--genome--min 0.2

--out pihat_min0.2

plink--bfile genotypes--extract indepSNP.prune.in--genome--min 0.2

--out pihat_min0.2_in_founders

plink--bfile genotypes--missing

C. randomly dividing all data sets (2247) into a training set (1573) and a testing set (674) with a ratio of 7:3, and respectively merging the training set and the testing set into a vcf storage format file in a sample genotyping mode.

plink--bfile genotype--export vcf--out dataset_vcf

2. Focal epilepsy genetic risk site weighting database and risk assessment model construction

(1) Downloading a summary file of large-scale GWAS meta analysis of Epilepsy dominated by The International League agaisnt epidemic on Complex epilabies (ILAE constellation on Complex epilabies), and constructing a risk site database containing 4,833,539 genetic related site information and effect values. The construction process is as follows:

wget http://www.epigad.org/gwas_ilae2018_16loci/focal_epilepsy_METAL.gz

wget http://www.epigad.org/gwas_ilae2018_16loci/focal_lesion_negative_BOLT-LMM_final.gz

awk'{if($15<1e-1)print$0}'focal_epilepsy_METAL>FE.effect.snp

for i in`cat effect.snp`;do grep$i focal_lesion_negative_BOLT-LMM_final>>FE.snp.effect.db;done

(2) constructing a model comprising a plurality of variables based on the risk site database-FE.snp.effect.db in the previous step, wherein the model is characterized by (x1, x2, x 3.., xn)

n-4,833,539, representing the number of features, i.e. a total of 4,833,539 polymorphic sites;

the model is as follows:

xirepresenting the ith individual, is the ith row of the feature matrix, and is a vector formed by 4,862,783 feature effect quantities;

(ii) an amount of effect representing the jth trait (i.e., SNP genotyping) of the ith individual; when the j-locus typing of an ith individual contains 0 copies of the at-risk allele,when the j-locus typing of the ith individual contains 1 copy of the at-risk allele,when the j-locus typing of the ith individual contains 2 copies of the at-risk allele,

a feature matrix (dataset. risk. matrix) of both data sets, training set and test set, is thus generated based on genotyping, for later analysis, the matrix format being as follows:

x0 x1 x2 ... xn y
0.9996477 1.0007279 1.0007296 ... 1 1
0.9992955 1 1.00146 ... 1 1
1 1.0007279 1 ... 0.9587425 0
... ... ... ... ... ...

y is a predicted value, and y is 1 and represents genetic high risk of the disease; y-0 represents a genetically low risk of disease.

(3) Model construction, assuming parameters PRS (Polygenic risk score, Polygenic genetic risk score).

PRS(i)=θ01x12x2+...θnxn

In the model training process, when y is 1, h (i) is approximately equal to 1, and PRS (i) > > 0; when y is 0, h (i) ≈ 0, prs (i) < < 0. Model training introduces a logistic regression cost function:

the realization process is as follows:

a total of 2247 sample grouping samples were randomly sampled, with 70% of the samples used for training and 30% for testing.

model=svmtrain(trainlabel,traindata,'-s 0-t 0-c 1.2');

q=svmPredict(model,trainlabel)

P=sv,Predict(model,testlabel)

Training accuracy=mean(double(q==traindata))*100

Test accuracy=mean(double(P==testdata))*100。

Because most of the existing large-scale GWAS research crowds are Caucasian crowds, the multi-gene risk assessment system constructed based on the method has generally low assessment efficiency in Chinese crowds, the focal epilepsy genetic risk assessment model established by the method is suitable for the focal epilepsy genetic risk scoring model of Chinese crowds, epilepsy is a disease with strong clinical phenotype and etiology heterogeneity, the method focuses on focal epilepsy with low genetic diagnosis rate and high inheritance degree to perform genetic risk assessment, and aims to provide another method for genetic diagnosis.

The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:人员健康档案管理系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!