Family coronary heart disease risk assessment and risk factor identification system

文档序号：193413 发布日期：2021-11-02 浏览：36次中文

阅读说明：本技术 一种家庭冠心病患病风险评估及其风险因素鉴定系统 (Family coronary heart disease risk assessment and risk factor identification system ) 是由马玉昆李�根贾寒韩仕伟孙琼琳李伟华于 2021-08-23 设计创作，主要内容包括：本发明公开了一种家庭冠心病患病风险评估及其风险因素鉴定系统。本发明所保护的一种家庭冠心病患病风险评估及其风险因素鉴定装置包括数据搜集与整理模块、多基因风险评分计算模块、个人风险预测模型搭建模块、家庭风险预测模型搭建模块、特定疾病的有利与有害因素评估模块。具体是以样本基因型数据为基础,采用多基因风险评分算法计算样本个人风险评分,然后通过机器学习算法构建预测模型,通过计算家系平均患病可能性给出家系中家庭的患病风险评估,再通过孟德尔随机化方法,提供与疾病有显著因果关联的有利因素与有害因素,帮助家庭更好的规避冠心病风险,保持健康,进一步为冠心病的预防、治疗与预后提供了证据支持与相关方法。(The invention discloses a family coronary heart disease risk assessment and risk factor identification system. The device for evaluating the family coronary heart disease risk and identifying the risk factors thereof comprises a data collecting and sorting module, a multi-gene risk score calculating module, a personal risk prediction model building module, a family risk prediction model building module and a favorable and harmful factor evaluating module for specific diseases. Specifically, based on sample genotype data, a multi-gene risk scoring algorithm is adopted to calculate sample personal risk scores, a prediction model is built through a machine learning algorithm, the risk evaluation of the family in the family is given through calculating the average risk of the family, and beneficial factors and harmful factors which have obvious causal association with diseases are provided through a Mendel randomization method, so that the family is helped to better avoid the risk of the coronary heart disease, the health is kept, and evidence support and a related method are further provided for the prevention, treatment and prognosis of the coronary heart disease.)

1. A family specific disease risk prediction and risk factor identification device is characterized in that: the device comprises the following modules:

A. a data collecting and sorting module: for obtaining whole genome genotype data of an individual sample associated with the particular disease, GWAS data for the particular disease and whole genome genotype data of a family sample;

B. a multigene risk score calculation module: for obtaining a polygene risk score for each of the individual samples;

C. the personal risk prediction model building module comprises: the system is used for determining an optimal individual specific disease risk prediction model based on the polygene risk score of the module B;

the C module comprises the following modules:

C1) a model building module: the method is used for building a plurality of individual specific disease risk prediction models;

C2) model training and testing module: for obtaining an optimal individual specific disease risk prediction model;

D. a family risk prediction model building module: the family risk prediction model is used for obtaining a family risk prediction result;

the D module comprises the following modules:

D1) family map calculation module: the family relationship determining method is used for determining the genetic relationship of the family samples and obtaining families in the family samples;

D2) the individual disease risk prediction module: obtaining a personal disease risk prediction value of each sample in the family samples;

D3) the family disease risk prediction module: for predicting the risk of disease for the household;

E. disease-specific favorable and harmful factor assessment module: for determining risk factors and benefit factors for the family-related specific disease;

the E module comprises the following modules:

E1) a specific disease-related exposure factor data acquisition module: GWAS study data for obtaining exposure factor GWAS study data and outcome variables; the outcome variable is the specific disease;

E2) tool variable screening determination module: for determining candidate tool variables;

E3) a causal relationship evaluation module of the exposure factors and the outcome variables: for evaluating a causal relationship of the exposure factor to the outcome variable;

E4) disease-specific favorable and harmful factor assessment module: for assessing risk factors and beneficial factors of the specific disease associated with the family.

2. The apparatus of claim 1, wherein: the whole genome genotype data of the module A is qualified SNP locus data of qualified samples obtained through quality control and genotype filling.

3. The apparatus of claim 1 or 2, wherein: C1) the model building module is built by a method comprising the following steps: based on the multiple-gene risk score of each sample obtained by the module B, and combined with the characteristic data of the sample, a disease risk prediction model of the personal specific disease is built by using multiple machine learning methods; the characteristic data comprises age and gender information of the sample;

and/or, C2) the model training and testing module is built by a method comprising the steps of:

splitting the individual samples in the module A, randomly selecting 80% of the individual samples as a training sample set, and selecting the remaining 20% of the individual samples as a test sample set; determining the data of the training sample set as training data, and determining the data of the test sample set as test data;

training the disease risk prediction model of the personal specific disease obtained in C1 by using the training data to obtain a regression coefficient of the disease risk prediction model;

testing the disease risk prediction model by using the test data, drawing an ROC curve, and calculating an area value under the ROC curve; and selecting the disease risk prediction model with the largest area value under the ROC curve as an optimal individual specific disease risk prediction model.

4. The apparatus of claim 3, wherein: the multiple machine learning methods are logistic regression, k nearest neighbor, decision tree, random forest and/or SVM; the personal specific disease risk prediction model is a logistic regression prediction model, a k-nearest neighbor prediction model, a decision tree prediction model, a random forest prediction model and/or an SVM prediction model.

5. The apparatus of any one of claims 1-4, wherein: D2) the individual disease risk prediction module of the family sample is established by a method comprising the following steps:

based on the optimal individual specific disease risk prediction model obtained in the module C, individual specific disease risk prediction is carried out on the samples in the family samples, and an individual specific disease risk prediction value of each sample in the family samples is obtained;

and/or the individual disease risk prediction module of the family sample D3 is established by a method comprising the following steps:

and (3) counting a judgment threshold value of the family disease risk in the family based on the individual specific disease risk prediction value of each sample in the family samples obtained in the module D2), and predicting the family specific disease risk in the family according to the judgment threshold value.

6. The apparatus of any one of claims 1-5, wherein: the specific disease is coronary heart disease; the optimal individual specific disease risk prediction model is an SVM prediction model.

7. The apparatus of any one of claims 1-6, wherein: the exposure factor is a micronutrient.

8. The apparatus of claim 6 or 7, wherein: the causal relationship between the exposure factor and the outcome variable is that a significant causal relationship exists between the reduction of the zinc element content and the coronary heart disease; the family coronary heart disease risk factor is zinc element.

9. A family-specific disease risk prediction device comprising A, B, C and a D module in the device of any one of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program for causing a computer to establish the steps of the apparatus of any one of claims 1-7 or the apparatus of claim 9.

Technical Field

The invention relates to the field of bioinformatics, in particular to a family coronary heart disease risk assessment and risk factor identification system.

Background

Coronary heart disease, commonly referred to as coronary atherosclerotic heart disease, is a heart disease caused by myocardial ischemia, hypoxia or necrosis due to stenosis or obstruction of the vascular lumen of coronary arteries caused by atherosclerotic lesions, which occur in the coronary arteries, and is often referred to as "coronary heart disease". The world health organization classifies coronary heart disease into 5 major categories: asymptomatic myocardial ischemia, angina pectoris, myocardial infarction, ischemic heart failure and sudden death 5 clinical phenotypes. The multi-gene genetic risk score (PRS) is a number calculated according to the variation of a plurality of gene loci and the corresponding weights thereof. When differences in multiple gene variables occur, a multigenic genetic risk score is the best predictor that multiple genes lead to a trait. In genome-wide association analysis (GWAS), the multigene genetic risk score is far better in prediction than a method for finding statistically significant genes in a genome-wide, the studied traits are affected not only by the statistically significant genes but also by many genes, and the larger the sample size is, the more genes affecting traits are. For traits with high heritability, only a small part of overall differences can be explained by other research methods, and by the multi-gene genetic risk method, once a multi-gene genetic score which can explain at least a few percent of overall differences is calculated, the score can be used as a lower bound for detecting whether the heritability is estimated with bias, so that a reasonable prediction of a certain trait is obtained.

Machine learning is a multidisciplinary cross specialty, covers probability theory knowledge, statistical knowledge, approximate theory knowledge and complex algorithm knowledge, uses a computer as a tool and is dedicated to a real-time simulated human learning mode, and only structure division is carried out on the existing content to effectively improve the learning efficiency. Machine learning is the science of artificial intelligence, and the main research object in the field is artificial intelligence, and especially how to continuously improve the performance of a specific algorithm in experience learning so that the result is more and more accurate. The research directions of traditional machine learning mainly comprise decision trees, random forests, support vector machines, artificial neural networks, Bayesian networks and the like, different methods are suitable for different scenes, and the prediction result can be more accurate and the effect is better by selecting a proper research method in use.

Mendelian randomization is a method that uses genetic variation in non-experimental data to estimate causal relationships between exposure factors and outcome variables, and is now widely used in disease research. In the mendelian randomization study, the causal Risk factors referred to using exposure factors, also known as intermediate phenotypes, can be biomarkers (biomarkers), can be anthropometric measures (Physical measures), or can be any other Risk factor that may affect outcome (Risk factor); outcome variables are used to refer to disease, but are not limited to disease.

Disclosure of Invention

The invention aims to solve the technical problem of how to evaluate the family coronary heart disease risk and/or identify the family coronary heart disease risk factors.

In order to solve the technical problems, the invention firstly provides a device for predicting the risk of the family specific diseases and identifying the risk factors of the family specific diseases. The apparatus may include the following modules:

A. a data collecting and sorting module: for obtaining whole genome genotype data of an individual sample associated with the specific disease, GWAS data of the specific disease and whole genome genotype data of a family sample.

B. A multigene risk score calculation module: for obtaining a multigene risk score for each of the individual samples.

C. The personal risk prediction model building module comprises: and the system is used for determining an optimal individual specific disease risk prediction model based on the multi-gene risk score of the module B.

The C module may specifically include the following modules:

C1) a model building module: the method is used for building a plurality of individual specific disease risk prediction models;

C2) model training and testing module: for obtaining an optimal individual specific disease risk prediction model.

D. A family risk prediction model building module: the family risk prediction method is used for obtaining a family risk prediction result through a family risk prediction model.

The D-module may specifically include the following modules:

D1) family map calculation module: the family relationship determining method is used for determining the genetic relationship of the family samples and obtaining families in the family samples;

D2) the individual disease risk prediction module: obtaining a personal disease risk prediction value of each sample in the family samples;

D3) the family disease risk prediction module: for predicting the risk of disease in said family.

E. Disease-specific favorable and harmful factor assessment module: for determining risk factors and beneficial factors for the family-related specific disease.

The E-module may specifically include the following modules:

E2) tool variable screening determination module: for determining candidate tool variables;

E3) a causal relationship evaluation module of the exposure factors and the outcome variables: for evaluating a causal relationship of the exposure factor to the outcome variable;

E4) disease-specific favorable and harmful factor assessment module: for assessing risk factors and beneficial factors of the specific disease associated with the family.

In the above device, the genome-wide genotype data in module a may be qualified SNP site data of qualified samples obtained through quality control and genotype filling. The GWAS data in the above module a may be standardized GWAS data obtained after quality control.

The qualified sample may be a sample having a detection rate greater than or equal to 97%. The qualified samples may include qualified individual samples and qualified family samples. The qualified SNP loci can be non-coincident SNP loci, SNP loci with the filling mass more than or equal to 0.3 and SNP loci which accord with Harwenberger equilibrium, have the genotype deletion rate less than or equal to 2 percent and have the frequency of the secondary allele more than or equal to 1 percent.

The process of calculating the multi-gene risk score (PRS) described above may be: unifying the normalized GWAS data and the qualified SNP locus data of the qualified individual sample by using a coord function in LDpred software to unify reference Linkage Disequilibrium (LD) information in the two groups of data; correcting the effect values of different SNP sites in the same research by using a gibbs function in LDpred software; performing a multi-gene risk score (PRS) calculation using the score function in LDpred software to obtain a PRS score for each of the samples of the qualified individuals.

In the above apparatus, C1) the model building module may be built by a method including the steps of: based on the multiple-gene risk score of each sample obtained by the module B, and combined with the characteristic data of the sample, a disease risk prediction model of the personal specific disease is built by using multiple machine learning methods; the characteristic data includes age and gender information of the sample.

In the above apparatus, C2) the model training and testing module may be built by a method comprising the steps of:

and splitting the individual samples in the module A, randomly selecting 80% of the individual samples as a training sample set, and selecting the remaining 20% of the individual samples as a test sample set. And determining the data of the training sample set as training data, and determining the data of the test sample set as test data.

And training the disease risk prediction model of the personal specific disease obtained in the step C1 by using the training data to obtain a regression coefficient of the disease risk prediction model.

And testing the risk prediction model by using the test data, drawing an ROC curve, and calculating an area value under the ROC curve. And selecting the disease risk prediction model with the largest area value under the ROC curve as an optimal individual specific disease risk prediction model.

The individual sample can be a qualified individual sample obtained by quality control. The data of the training sample set may be PRS scores and feature data of samples in the training sample set. The data of the test sample set may be PRS scores and feature data of samples in the test sample set.

In the apparatus described above, the plurality of machine learning methods may be logistic regression, k-nearest neighbors, decision trees, random forests, and/or SVMs. The personal specific disease risk prediction model can be a logistic regression prediction model, a k-nearest neighbor prediction model, a decision tree prediction model, a random forest prediction model and/or an SVM prediction model.

As described above, the various machine learning methods are used, and specifically, the various machine learning methods in the sklern module in Python may be used.

In the above device, D1) the family map calculation module may be established by a method comprising the steps of:

according to the genotype data of the family sample in the A module, calculating the corresponding family map by using a build software build function, calculating the genetic similarity by using a related function, counting the number of homologous identical segments (IBD), obtaining a homologous identical segment (IBD) map by using a KING _ segments _ plot function, and finally determining the genetic relationship of the family sample to obtain the family in the family sample. And qualified SNP locus data of the qualified family samples obtained by the quality control.

The concept of the family can be individuals, normal families of the family, which are descended from ancestors or older forms, and the number of members of each generation of a certain family, the relativity and the distribution of related genetic traits or genetic diseases in the family are recorded, and generally comprises three generations or more. The family can be a tool for displaying the structure, family relationship and genetic history of the family. The concept of family may be a social life unit generated based on marital relations and kindred relations, including parents, children and other relatives living together.

In the above device, the module for predicting the personal risk of disease of the pedigree sample in D2) may be established by a method comprising the following steps:

and predicting the personal specific disease risk of the samples in the family samples based on the optimal personal specific disease risk prediction model obtained in the module C to obtain the personal specific disease risk prediction value of each sample in the family samples.

In the above device, D3) the module for predicting the personal risk of illness of the family sample can be established by a method comprising the following steps:

The decision threshold described above may be divided into a high risk decision threshold and a low risk decision threshold. The high risk decision threshold and the low risk decision threshold may be determined by an average prevalence prediction value distribution of the family. The average disease prediction value distribution of the family can be obtained by calculation according to the individual specific disease risk prediction values in the family sample. The high risk determination threshold may be a critical value of the mean prevalence prediction value distribution of the family from high to low by the top five percent. The low risk decision threshold may be a critical value of the mean disease prediction value distribution of the family from high to five percent low.

The specific disease described above may be coronary heart disease. The optimal personal specific disease risk prediction model described above may be a SVM prediction model.

The exposure factor described above may be a micronutrient. The micronutrients may be calcium, iron, zinc, copper, magnesium, vitamin D, etc. The exposure factor may also be other non-genetic factors.

In the above device, E3) the causal relationship assessment module of exposure factors and outcome variables may be established by a method comprising:

based on the research results of the GWAS of the micronutrients and the research results of the GWAS of the coronary heart disease, the causal relationship between the micronutrients and the coronary heart disease is evaluated by a double-sample Mendelian randomization strategy and by an inverse variance weighting method and an MR-Egger method.

In the above-mentioned apparatus, the significant correlation described in E2) may specifically be that P is 5E × 10 or less^-8。

In the above-described device, the causal relationship between the exposure factor and outcome variable may be a significant causal link between a reduction in zinc element content and coronary heart disease. The family coronary heart disease risk factor can be zinc element.

In order to solve the technical problem, the invention also provides a family specific disease risk prediction device. The device may include A, B, C and a D-module in the device described above.

The specific disease described above may be coronary heart disease.

In order to solve the above technical problem, the present invention also provides a computer-readable storage medium storing a computer program. The computer program causes a computer to establish the steps of the means as described above.

The invention predicts the family coronary heart disease risk in 1000 families by using the established family specific disease risk prediction and disease risk factor identification device, takes the family average disease possibility value of 0.89 in the family as the high risk judgment threshold, and marks the family with high risk probability if the average disease possibility value of the family to be identified is more than 0.89; taking the average family morbidity probability value of 0.03 as a low-risk judgment threshold value, and if the average family morbidity probability value of the family to be identified is less than the threshold value of 0.03, marking the family with low risk of morbidity; if the average disease possibility value of the identification family is less than or equal to 0.89 and more than or equal to 0.03, the identification family is marked as a family with general disease risk possibility. Meanwhile, the device is used for predicting that the risk factor of the family coronary heart disease is zinc element in the micronutrients. This means that in the existing trace element research, any association between genetic variation and coronary heart disease must be performed through the association between genetic variation and the trace element zinc, thus suggesting the causal relationship of the trace element zinc to coronary heart disease, and further providing evidence support and related methods for the prevention, treatment and prognosis of family coronary heart disease.

Drawings

FIG. 1 is a flow chart of the system for family risk assessment and risk factor identification of coronary heart disease based on chip data provided by the present invention.

FIG. 2 is a ROC curve diagram and an AUC value of the SVM method with the best prediction effect on the individual coronary heart disease risk.

Detailed Description

The present invention is described in further detail below with reference to specific embodiments, which are given for the purpose of illustration only and are not intended to limit the scope of the invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.

The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Embodiment I, family coronary heart disease risk assessment and risk factor identification system

First, establishment of family coronary heart disease risk assessment and risk factor identification system

1. Data gathering and sorting

Collecting and obtaining genotype data of a whole genome of a coronary heart disease related sample and coronary heart disease whole genome association analysis (GWAS) data; performing quality control on the collected original genotype data, and performing genotype filling on the genotype data after quality control to finally obtain qualified SNP site data of qualified samples; and meanwhile, carrying out quality control on the collected GWAS data to obtain standardized GWAS data.

1.1. Data gathering

1.1.1 Individual sample Whole genome genotype data acquisition

Chip sequencing:

whole genome genotype data of an individual is obtained. The method comprises the following specific steps:

(1) collecting individual sample data: samples of patients with coronary heart disease and healthy individuals, wherein the healthy individuals serve as control samples of patients with coronary heart disease;

(2) acquiring whole genome genotype data of a coronary heart disease patient and a healthy individual sample by using an Illumina ASA _ CHIA chip platform of a million chip plan which is drawn to and customized by Beijing nutshell biotechnology limited;

1.1.2 acquisition of GWAS data

Gathering coronary heart disease whole genome association analysis (GWAS) data

1.1.3 collecting family sample data

Collecting genetype data of the family whole genome;

1.2 data quality control and genotype filling

1.2.1 Individual sample genotype data quality control and genotype filling

And (3) carrying out sample quality control on the whole genome genotype data (chip data) obtained in the step 1.1.1, removing samples with the detection rate lower than 97%, and removing individuals with inconsistent sexes to obtain whole genome SNP site information data of qualified samples.

Carrying out genotype filling (imputation) on the obtained whole genome SNP site information data of the qualified sample to obtain a filled SNP site: genotype filling was performed using impute2 software, with reference to genome data of the thousand human genome project Phase 3; then, the quality control is carried out on the filled SNP sites, specifically, the points with smaller filling quality are deleted (the threshold value is 0.3, and the sites with the filling quality lower than 0.3 are deleted), and the SNP which does not accord with the Harveburg equilibrium is removedSite (P value less than 1X 10)^-5Deleting the sites with genotype deletion rate more than 2 percent, and removing SNP sites with sub-allelic gene frequency less than 1 percent to obtain the final qualified SNP site data of qualified individual samples.

1.2.2GWAS data quality control

And (3) carrying out data standardization on the GWAS data collected in the step 1.1.2 to obtain the standardized GWAS data.

1.2.3 family sample data quality control and genotype filling

And (3) carrying out sample quality control on the whole genome genotype data (chip data) of the family samples obtained in the step (1.1.3), removing samples with the detection rate lower than 97%, and removing individuals with inconsistent sexes to obtain whole genome SNP site information data of the qualified family samples.

Carrying out genotype filling (imputation) on the obtained whole genome SNP site information data of the qualified family sample to obtain a filled SNP site: genotype filling was performed using impute2 software, with reference to genome data of the thousand human genome project Phase 3; then, the quality control is carried out on the filled SNP loci, specifically, the loci with smaller filling quality are deleted (the threshold value is 0.3, and the loci with the filling quality lower than 0.3 are deleted), and the SNP loci which do not accord with Harveberger equilibrium are removed (the P value is less than 1 multiplied by 10)^-5Deleting the site(s), removing the site(s) with genotype deletion rate of more than 2%, and removing the SNP site(s) with sub-allelic gene frequency of less than 1%, to obtain the final qualified SNP site data of the qualified family sample.

2. Multi-gene risk score calculation

And (3) calculating multi-gene risk score (PRS) by using LDpred software according to the standardized GWAS data obtained in the step (1.2.2) and the qualified SNP locus data of the qualified sample obtained in the step (1.2.1) to obtain the corresponding PRS score of a single sample.

3. Setting up personal risk prediction model

Splitting qualified individual samples into a training sample set and a testing sample set without sample intersection according to the PRS score of the samples obtained in the step 2; and constructing a plurality of disease risk evaluation models by adopting a plurality of machine learning methods, respectively training and testing in an independent training sample set and a test sample set, and selecting a model with optimal evaluation indexes as a final personal risk evaluation system. The method comprises the following specific steps:

3.1. construction of multiple disease risk assessment models

Based on the PRS score data of the single sample obtained in the step (2), combining the characteristic data (the age and the gender information of the sample) of the sample, using a sklern module of Python language, and using machine learning methods such as logistic regression, k neighbor, decision tree, random forest, SVM and the like to construct various personal risk prediction models;

3.2. model training and testing

And (3) splitting the qualified individual samples obtained in the step 1.2.1, randomly selecting 80% of the samples as a training sample set, and selecting the rest 20% of the samples as a testing sample set. Determining data of a training sample set (PRS score data of a sample and feature data of the sample) as training data, and determining data of a test sample set (PRS score data of the sample and feature data of the sample) as test data;

training the multiple personal risk prediction models obtained in the step 3.1 by using training data to obtain corresponding regression coefficients of the models;

using the test data to perform performance test on various personal risk prediction models, drawing an ROC Curve, and calculating an Area Under the ROC Curve (AUC) value; and selecting the personal risk prediction model constructed by the machine learning method with the maximum AUC value as the optimal prediction model (namely the personal risk evaluation system).

4. Building a family risk prediction model to predict the family risk of diseases

And (3) according to the family samples collected in the step 1.1.3, establishing family disease risk evaluation judgment standards through the genetic relationship among the members of the family samples, and giving family risk evaluation results by combining the results of the individual disease risk evaluation system obtained in the step 3.2. The method comprises the following specific steps:

4.1. calculating family map and analyzing family data

According to the qualified SNP site data of the qualified family sample obtained in the step 1.2.3, calculating a corresponding family map by using a KING software build function, calculating the genetic similarity of the family map by using a related function, counting the number of homologous identical fragments (IBD), obtaining a homologous identical fragment (IBD) map by using a KING _ segments _ plot function, and finally determining the genetic relationship in the family sample to obtain the family (unit) in the family sample.

4.2. Personal risk prediction of familial samples

Based on the optimal prediction model obtained in the step 3.2, carrying out individual risk prediction on the samples in the family to obtain an individual risk prediction value of each sample in the family samples;

4.3. predicting risk of disease in families in family group

Constructing a family risk assessment judgment standard based on the individual disease risk prediction value of each sample in the family obtained in the step 4.2, and giving a family risk assessment result in the family; the method comprises the following specific steps: and (4) counting the average family disease probability in the family, determining a judgment threshold value of the family disease risk, and predicting the family disease risk in the family according to the judgment threshold value.

5. Assessment of beneficial and detrimental factors of coronary heart disease

Downloading GWAS research results related to micronutrients (exposure factors), screening remarkably related genetic susceptibility sites as tool variables, downloading GWAS research results related to coronary heart disease (outcome variables), and evaluating related risk factors and beneficial factors of the coronary heart disease (outcome variables) by adopting a two-sample Mendelian randomization method to obtain non-genetic factors of remarkable causal relationship, wherein the non-genetic factors can be applied to follow-up coronary heart disease prevention or intervention; the step 5 comprises the following steps:

5.1. downloading coronary heart disease related exposure factor data

Downloading GWAS research results related to micronutrients (exposure factors) and GWAS research results related to coronary heart disease (outcome variables);

5.2. screening for determining tool variables

Screening genetic susceptible sites which are obviously related in the exposure factor file as candidate tool variables, adjusting a palindromic sequence, and removing linkage disequilibrium sites;

5.3. assessing causal relationship between exposure factors and coronary heart disease

Evaluating the causal relationship between the exposure factor and the coronary heart disease (outcome variable) by a two-sample Mendelian randomization strategy by using an inverse variance weighting method and an MR-Egger method;

5.4. assessment of beneficial and detrimental factors of coronary heart disease

And (4) evaluating risk factors and beneficial factors based on the result obtained in the step 5.3 to obtain non-genetic factors with significant causal relationship or permit the non-genetic factors to be used for preventing and intervening subsequent coronary heart disease.

Second embodiment, application example of family coronary heart disease risk assessment and risk factor identification system

1. Data gathering and sorting

1.1. Data gathering

1.1.1 Individual sample Whole genome genotype data acquisition

Chip sequencing:

acquiring whole genome genotype data of each individual of a sample by using an Illumina ASA _ CHIA chip platform of a million chip plan which is drawn to and customized by Beijing husk Biotechnology Limited;

the specific contents are as follows:

(1) 239 patient data of the coronary heart disease after desensitization are collected to be used as a case group (case), and 500 healthy individuals are randomly selected from a chip database of Beijing nutshell biotechnology limited to serve as a control group (control) according to the result information of age and sex of the case group. In specific implementation, two groups of selected objects are required to be matched with structural information such as age, sex and the like, and are from Chinese Han people.

(2) Acquiring whole genome genotype data of patients with coronary heart disease and healthy individuals by using an Illumina ASA _ CHIA chip platform of a million-chip plan which is drawn to and customized by Beijing nutshell biotechnology limited;

1.1.2 acquisition of GWAS data

Coronary heart disease Whole Genome Association Analysis (GWAS) data (including two sets of GWAS data, relevant literature: Yamaji T, Sawada N, Iwasaki M.Transethnic Meta-Analysis of Genome-Wide Association Studies Identifies Three New Loci and Characterizes Population-Specific Differences for Coronary Artery Disease.Circ Genom Precis Med.2020 Jun；13(3):e002670.doi:10.1161/CIRCGEN.119.002670.Epub 2020 May 29.PMID:32469254)(Nikpay Majid,Goel Anuj,Won Hong-Hee,&Leo-.(2015).A comprehensive 1,000Genomes-based genome-wide association meta-analysis of coronary artery disease.Nature genetics(10),doi:10.1038/ng.3396.)；

1.1.3 collecting family sample data

1000 groups of family complete genome genotype data (Beijing husk biological database) are collected;

1.2 data quality control and genotype filling

1.2.1 genotype data quality control and genotype filling

The genetic type data (chip data) of the whole genome of the 239 coronary heart disease patients collected in the step 1.1.1 and 739 individual samples of 500 healthy individuals randomly selected from a chip database of Beijing nutshell biotechnology limited are subjected to quality control to standardize the data, misaligned SNP sites are removed, 738980 sites are remained, samples with the detection rate lower than 97% are removed, and 0 sample is removed altogether, so that the information data of the whole genome SNP sites of the 739 qualified individual samples are obtained.

Genotype filling (imputation) was performed on the genome-wide SNP site data of the 739 individual samples obtained as described above using SNP site information: genotype filling was performed using IMPUTE2 software (https:// genome. sph. umich. edu/wiki/IMPUTE 2:. 1000_ Genomes _ Impulse _ Cookbook) with the genome of the thousand human genome project Phase3 (https:// genome. sph. umich. edu/wiki/Minimac: 1000_ Genomes _ Impulse _ Cookbook) as a reference, yielding a total of 2157223 SNP sites; then, the quality control is carried out on the filled SNP sites, and the points with smaller filling quality are deleted (the threshold value is 0.3, and the sites with the filling quality lower than 0.3 are deleted); SNP sites that do not meet Harveberg equilibrium are removed (threshold of 1X 10)^-5P value less than 1X 10^-5Site deletion of (2)(ii) a Removing loci with genotype deletion rate more than 2%; SNP sites with a frequency of minor alleles less than 1% were removed to obtain qualified SNP site (2150395 sites) data of the final qualified individual (739 cases) specimen.

1.2.2GWAS data quality control

And (3) carrying out data standardization on the GWAS data collected in the step 1.1.2 to obtain the standardized GWAS data.

1.2.3 family sample data quality control and genotype filling

Carrying out genotype filling (imputation) on the obtained whole genome SNP site information data of the qualified family sample to obtain a filled SNP site: genotype filling was performed using impute2 software, with reference to genome data of the thousand human genome project Phase 3; then, the quality control is carried out on the filled SNP loci, specifically, the loci with smaller filling quality are deleted (the threshold value is 0.3, and the loci with the filling quality lower than 0.3 are deleted), and the SNP loci which do not accord with Harveberger equilibrium are removed (the P value is less than 1 multiplied by 10)^-5Deletion of the site), removing the site with genotype deletion rate more than 2%, removing the SNP site with sub-allele frequency less than 1%, and obtaining the qualified SNP site (2150395 sites) data of the final qualified family sample (4000 cases).

2. Multi-gene risk score calculation

Unifying the normalized GWAS data obtained in the step 1.2.2 and the qualified SNP site data of the qualified sample obtained in the step 1.2.1 by using a coord function in LDpred software (https:// githu. com/bvihjal/LDpred) to obtain reference Linkage Disequilibrium (LD) information in the two groups of data; correcting the sizes of the effect values of different SNP sites in the same GWAS data source by using a gibbs function in LDpred software; and (3) performing multi-gene risk score (PRS) calculation by using a score function in LDpred software to obtain the PRS score of a corresponding single sample.

3. Setting up personal risk prediction model

And (3) splitting the 739 qualified samples obtained by quality control in the step 1.2.1, randomly selecting 80% of samples of the total samples as a training sample set, and selecting the rest 20% of samples as a testing sample set. Determining data of a training sample set (PRS score data of a sample and feature data of the sample) as training data, and determining data of a testing sample set (PRS score data of the sample and feature data of the sample) as testing set data; and constructing a plurality of disease risk evaluation models by adopting a plurality of machine learning methods, respectively training and testing in independent sample training sets and test sets, and selecting a model with optimal evaluation indexes as a final personal risk evaluation system.

The specific contents are as follows:

3.1. construction of multiple disease risk assessment models

Based on the PRS score of the single sample obtained in the step (2), combined with the age and gender information of the sample, a plurality of machine learning methods in a sklern module in Python (https:// www.python.org /) are used for prediction, and a plurality of personal risk prediction models are constructed, wherein the personal risk prediction models comprise a logistic regression prediction model, a k neighbor prediction model, a decision tree prediction model, a random forest prediction model and an SVM prediction model;

3.2. model training and testing

The 739 qualified individual samples obtained in step 1.2.1 are split, and 80% of samples (591 samples) of the total samples are randomly selected as a training sample set, and the remaining 20% of samples (148 samples) are selected as a testing sample set. Determining data of a training sample set (PRS score data of a sample and feature data of the sample) as training data, and determining data of a test sample set (PRS score data of the sample and feature data of the sample) as test data;

training the multiple personal risk prediction models obtained in the step 3.1 by using training data to obtain corresponding regression coefficients of the models;

the test data is used for carrying out performance test on various personal risk prediction models, ROC curves are drawn, AUC values are calculated, results show that the prediction effect of the SVM prediction model is best, the AUC value can reach 0.792, and therefore the SVM prediction model is selected as an optimal prediction model (namely an optimal personal risk assessment system);

4. building a family risk prediction model to predict the family risk of diseases

4.1. calculating family map and analyzing family data

Judging the genetic relationship of the qualified SNP site data of the qualified family sample obtained in the step 1.2.3 by using a KING software (https:// www.chen.kingrelatedness.com/# pedigree), wherein the judgment result shows that the genetic relationship is accurate: calculating the corresponding family map by using a build function in the KING software, and drawing a map; calculating the genetic similarity of the related function, counting the number of homologous Identical Segments (IBDs), obtaining an image of homologous Identical Segments (IBDs) by using a king _ segments _ plot function, and finally determining the genetic relationship in the family sample through mutual verification of two dimensions to obtain the genetic family relationship in the ethical family sample.

4.2. Predicting the individual disease risk of the family samples based on the optimal prediction model SVM prediction model screened in the step 3.2, and performing individual disease risk prediction on the samples in the family to obtain individual disease risk prediction values in the family samples, namely the possibility that individuals suffer from coronary heart disease;

4.3. predicting risk of disease in families in family group

And (4) determining the average family disease risk property based on the individual disease risk predicted value in the family sample obtained in the step 4.2, defining a risk level defining threshold value, and giving a family risk evaluation result.

The specific method comprises the following steps: calculating individual disease risk prediction values of each person in 1000 families, and taking the average number of the individual disease risk prediction values as a family average disease probability value; counting the average disease probability value and the distribution of the family in 1000 families, taking the boundary value of the first five percent of 0.89 as a high-risk judgment threshold value, and if the average disease probability value of the family to be identified is more than 0.89, marking the family as a high-risk family; taking the boundary value of the last five percent of 0.03 as a low-risk judgment threshold value, and if the average disease probability value of the family to be identified is less than the threshold value of 0.03, marking the family to be identified as the family with low risk of disease; if the average disease possibility value of the identification family is less than or equal to 0.89 and more than or equal to 0.03, the identification family is marked as a family with general disease risk possibility;

5. assessment of beneficial and detrimental factors of coronary heart disease

The method comprises the steps of downloading GWAS research result data of non-genetic risk factors (exposure factors) related to the coronary heart disease, screening genetic susceptibility SNP sites which are obviously related to the coronary heart disease as tool variables, adopting a two-sample Mendelian randomization method to evaluate the related risk factors and beneficial factors of the coronary heart disease as outcome variables, and obtaining the significant non-genetic factors of causal relationship, wherein the significant non-genetic factors can be applied to subsequent prevention or intervention of the coronary heart disease.

The specific contents are as follows:

5.1. downloading coronary heart disease related exposure factor data

The results of the GWAS study of micronutrients (as exposure factors) were downloaded and quality controlled (including five sets of Meta analytical data: calcium, iron, copper and zinc, magnesium, and vitamin D related data). Genetic variation sites for calcium were derived from a European Meta analysis, consisting of 17 39400 individuals in a population-based cohort (O' SEAGHDHA C M, WU H, YANG Q, et al. Meta-analysis of genome-wide association students identities six new for sodium carbonate conjugates [ J ]. PLoS genes, 2013,9(9): e 1003796.); the genetic variation site of iron element is derived from a serum iron Meta assay comprising 12000 people (RAFFIELD L M, LOUIE T, SOFER T, et al, genome-wide association Study of iron tracks and relationships in the Hispanic communication Health Study/Study of Latinos (HCHS/SOL): positional genetic interaction of iron and glucose regulation [ J ]. Human molecular genetics,2017,26(10): 1966-78.); the magnesium-related genetic variation sites were derived from serum magnesium Meta analysis of 15366 participants of the international CHARGE association (MEYER T E, verwort G C, HWANG S J, et al genome-wide association of serum magnesium, potassium, and sodium associations identity x local infection of serum magnesium levels [ J ]. PLoS genetics,2010,6 (8)); the genetic variation sites of copper and zinc are derived from a GWAS study involving 2603 adults (EVANS D M, ZHU G, DY V, et al, genome-wide association study identification of local infection reagent, selenium and zinc [ J ]. Human molecular genetics,2013,22(19):3998 4006.); the genetic variation site of vitamin D is derived from a Meta assay comprising 79366 European people (JIANG X, O' REILLY P F, ASCHARD H, et al genome-wide association study in 79,366 European-antibiotic interactions for the genetic architecture of25-hydroxyvitamin D levels [ J ]. Nature communications,2018,9(1): 260.).

Coronary heart disease (as an outcome variable) related GWAS Studies were downloaded and quality controlled (including two sets of GWAS data, Yamaji T, Sawada N, Iwasaki M.Transethnic Meta-Analysis of Genome-Wide Association candidates Three New Loci and Characterise Population-specificity Difference for Coronary arm area research. Circuit come Presi Med.2020 Jun; 13 (E002670. doi: 10.1161/CGEN.119.002670. Epub 2020 May 29.PMID:32469254.) (Nikpay Majjn, Gotujenjj, Won Hong-Hee,&Leo-.(2015).A comprehensive 1,000Genomes-based genome-wide association meta-analysis of coronary artery disease..Nature genetics(10),doi:10.1038/ng.3396.)。

5.2. screening for determining tool variables

Selection of exposure factor files (micronutrient-related GWAS data downloaded in step 5.1) various micronutrient significant correlations (P)<＝5e×10^-8) The SNP locus is used as a tool variable candidate locus, then the locus which cannot adjust the palindromic sequence in the tool variable is excluded, the linkage disequilibrium locus is removed, and the remaining SNP is confirmed as the tool variable to obtain a tool variable file;

5.3. assessing causal relationship between exposure factors and coronary heart disease

According to the exposure factor (micronutrient) tool variable file obtained in the step 5.2 and the coronary heart disease whole genome correlation analysis quality control (GWAS) data downloaded in the step 1.1.2, the causal relationship between the micronutrients and the coronary heart disease is evaluated by a double-sample Mendelian randomization strategy and an inverse variance weighting method and an MR-Egger method, and the results show that: there is a significant causal link between a reduction in zinc content and coronary heart disease (OR 1.06, P0.04, 95% CI 1.001-1.126); at normal levels, the risk of coronary heart disease increases by 0.06-fold for each unit decrease in zinc (0.5 md/dL). The rest trace elements have no obvious cause and effect relationship to the coronary heart disease.

5.4. Assessment of beneficial and detrimental factors of coronary heart disease

From the results of 5.3, it can be seen that the reduction in zinc content is a detrimental factor for coronary heart disease, so care should be taken to maintain normal zinc levels in each member of the family, with appropriate supplementation to reduce the risk of coronary heart disease in each member of the family to maintain health; the other trace elements have no obvious causal relationship with coronary heart disease, and are neither harmful nor beneficial.

Embodiment three, a family coronary heart disease risk assessment (prediction) and risk factor identification device

Based on the application examples of the family coronary heart disease risk assessment and risk factor identification system in the first embodiment and the family coronary heart disease risk assessment and risk factor identification system in the second embodiment, the device for family coronary heart disease risk assessment (prediction) and risk factor identification is obtained, and the device comprises the following modules:

A. data collecting and sorting module

A1) A data collection module: used for collecting genotype data of individual samples, coronary heart disease related GWAS data and genotype data of family samples. The method is specifically established by the following steps:

collecting and obtaining genotype data of a whole genome of an individual sample related to coronary heart disease; collecting and obtaining coronary heart disease whole genome association analysis (GWAS) data; genotype data for the whole genome of the pedigree sample was collected.

If the collected data are original genotype data and original GWAS data, A2) data sorting process is needed; if the collected data is the data after data arrangement, namely the genotype data after quality control and genotype filling and the standardized GWAS data, the following module B can be directly carried out.

A2) A data sorting module: for quality control of the data collected in the a1) module. The method is specifically established by the following steps:

the method comprises the steps of carrying out quality control and genotype filling on the genotype data of the whole genome of an individual sample, carrying out GWAS data quality control and genotype filling on the genotype data of a family sample. The method for performing quality control and genotype filling on the genotype data of the whole genome of the individual sample comprises the following steps: carrying out quality control on the original genotype data of the collected coronary heart disease related individual sample, and carrying out genotype filling on the genotype data after quality control to finally obtain qualified SNP site data of the qualified individual sample; the step of carrying out quality control on GWAS data comprises the following steps: performing quality control on the collected GWAS data to obtain standardized GWAS data; the method comprises the following steps of carrying out quality control on genotype data of the collected family samples and filling the genotypes of the collected family samples: and performing quality control on the collected original genotype data, and performing genotype filling on the genotype data after quality control to finally obtain qualified SNP locus data of the qualified family samples.

B. Multi-gene risk score calculation module

A multi-gene risk score (PRS) for obtaining a single sample, established by:

based on the data collected and sorted in the module A (standardized GWAS data and qualified SNP locus data of qualified individual samples), LDpred software is used for performing multi-gene risk score (PRS) calculation to obtain the PRS score of each sample in the qualified individual samples.

C. Individual risk prediction model building module

And the method is used for determining an optimal individual disease risk prediction model based on the PRS score obtained by the module B.

C1) A model building module: the method is used for building a plurality of individual specific disease risk prediction models. The method is specifically established by the following steps:

based on the PRS of a single sample obtained by the module B, a personal risk prediction model is constructed by combining the characteristic data (the age and the gender information of the sample) of the sample and using a sklern module of Python language and using a plurality of machine learning methods such as logistic regression, k neighbor, decision tree, random forest, SVM and the like: a logistic regression prediction model, a k-nearest neighbor prediction model, a decision tree prediction model, a random forest prediction model and an SVM prediction model.

C2) Model training and testing module: for obtaining an optimal individual-specific disease risk prediction model (optimal individual risk prediction model). The method is specifically established by the following steps:

and (3) splitting qualified individual samples obtained by quality control in the module A2), randomly selecting 80% of the individual samples as a training sample set, and selecting the remaining 20% of the individual samples as a testing sample set. Determining data of a training sample set (PRS score data of a sample and feature data of the sample) as training data, and determining data of a test sample set (PRS score data of the sample and feature data of the sample) as test data;

training the multiple personal risk prediction models obtained in the step 3.1 by using training data to obtain corresponding regression coefficients of the models;

D. Family risk prediction model building module

The family risk prediction model is used for obtaining a family risk evaluation result.

D1) Family map calculation module: and determining the genetic relationship of the family samples to obtain families in the family samples. The method is specifically established by the following steps:

according to the qualified SNP site data of the qualified family sample obtained by quality control in the module A2), calculating a corresponding family map by using a KING software build function, calculating the genetic similarity by using a related function, counting the number of homologous identical fragments (IBD), obtaining a homologous identical fragment (IBD) map by using a KING _ segments _ plot function, and finally determining the genetic relationship of the family sample.

D2) The individual disease risk prediction module: and obtaining the individual disease risk prediction value of each sample in the family samples. The method is specifically established by the following steps:

and (4) carrying out individual risk prediction on the samples in the family based on the optimal individual risk prediction model obtained in the module C to obtain an individual risk prediction value of each sample in the family samples.

D3) The family disease risk prediction module: used for predicting the risk of the family in the family group.

Constructing a family risk evaluation judgment standard in the family based on the individual disease risk prediction value of each sample in the family obtained in the module D2), and giving a family risk evaluation result in the family; the method comprises the following specific steps: and (4) counting the average family disease probability in the family, determining a judgment threshold value of the family disease risk, and predicting the family disease risk in the family according to the judgment threshold value.

E. Disease-specific favorable and harmful factor assessment module: for determining family-related risk factors and beneficial factors for a particular disease.

E1) A specific disease-related exposure factor data acquisition module: for obtaining exposure factor-related GWAS study data and outcome variable (specific disease) -related GWAS study data. The method is specifically established by the following steps:

the results of the GWAS study (data file) related to micronutrients (exposure factors) and the results of the GWAS study (data file) related to coronary heart disease (outcome variables) were downloaded.

E2) Tool variable screening determination module: for determining candidate tool variables. The method is specifically established by the following steps:

and (3) screening genetic susceptible sites which are obviously related to the trace elements in the research result of the micronutrient GWAS as candidate tool variables, adjusting a palindromic sequence and removing linkage disequilibrium sites.

E3) A causal relationship evaluation module of the exposure factors and the outcome variables: for assessing the causal relationship of exposure factors to outcome variables. The method is specifically established by the following steps:

based on the research results of GWAS of micronutrients and the research results of GWAS of coronary heart disease, the causal relationship between the micronutrients and the coronary heart disease is evaluated by a double-sample Mendelian randomization strategy and an inverse variance weighting method and an MR-Egger method.

E4) Disease-specific favorable and harmful factor assessment module: used for evaluating the risk factors and beneficial factors of specific diseases related to families. The method is specifically established by the following steps:

based on the result obtained by the module E3), risk factors and beneficial factors related to the coronary heart disease are evaluated, non-genetic factors of a significant causal relationship are obtained, and the non-genetic factors can be used for preventing and intervening subsequent coronary heart disease.

In summary, the invention provides a family specific disease risk assessment and risk factor identification system and device. Specifically, based on family data, a multi-gene risk scoring algorithm is adopted to calculate individual risk scoring, a prediction model is built through a machine learning algorithm, corresponding risk assessment of the family is given through calculation of the average disease probability of the family, and beneficial factors and harmful factors which are obviously causally related are provided through a Mendel randomization method, so that the family is helped to better avoid specific disease risks, and the health is kept. Further provides evidence support and related methods for the prevention, treatment and prognosis of specific diseases. The invention takes coronary heart disease as an example to carry out family coronary heart disease risk assessment and risk factor identification, obtains the family coronary heart disease risk assessment result and takes the micronutrient zinc element as the risk factor of family-related coronary heart disease, and can further provide evidence support and a related method for the prevention, treatment and prognosis of family coronary heart disease.

The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is possible within the scope of the claims attached below.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种预测蛋白质-多肽结合位点的方法及系统

Family coronary heart disease risk assessment and risk factor identification system

相关技术

网友询问留言