Method, computing device and storage medium for predicting risk of predetermined disease

文档序号：1891650 发布日期：2021-11-26 浏览：35次中文

阅读说明：本技术 用于预测预定疾病风险的方法、计算设备和存储介质 (Method, computing device and storage medium for predicting risk of predetermined disease ) 是由邢传华王湧于 2021-07-28 设计创作，主要内容包括：本公开涉及一种用于预测预定疾病风险的方法、计算设备和存储介质。该方法包括：生成每个待测样本的多个基因的基因突变数据；获取与每个待测样本相关联的多个非基因数据；经由回归算法,确定用于指示基因突变和非基因数据对每个待测样本的表型的影响的权重函数；针对每个待测样本,基于每个基因上的基因突变、非基因数据和对应权重,计算待测样本的所属对象发生预定疾病的风险分数；以及基于所计算的待测样本的所属对象发生预定疾病的风险分数,确定所属对象在预定时间内发生预定疾病的绝对风险,用以预测待测样本的预定疾病风险。本公开能够针对个体更早、更准确地预测发展为预定疾病的风险。(The present disclosure relates to a method, computing device and storage medium for predicting risk of a predetermined disease. The method comprises the following steps: generating gene mutation data of a plurality of genes of each sample to be detected; acquiring a plurality of non-genetic data associated with each sample to be tested; determining, via a regression algorithm, a weight function indicative of the impact of the genetic mutation and the non-genetic data on the phenotype of each test sample; calculating the risk score of the object of each sample to be detected for the occurrence of the predetermined disease based on the genetic mutation, the non-genetic data and the corresponding weight on each gene; and determining the absolute risk of the object to generate the predetermined disease in the predetermined time based on the calculated risk score of the object to generate the predetermined disease of the sample to be detected, so as to predict the risk of the predetermined disease of the sample to be detected. The present disclosure enables an earlier, more accurate prediction of the risk of developing a predetermined disease for an individual.)

1. A method for predicting risk of a predetermined disease, comprising:

acquiring gene sequencing data of a plurality of samples to be detected so as to generate gene mutation data of a plurality of genes of each sample to be detected;

acquiring a plurality of non-genetic data associated with each sample to be tested;

determining, via a regression algorithm, a weight function indicative of the impact of the genetic mutation and the non-genetic data on the phenotype of each test sample;

calculating the risk score of the object of each sample to be detected for the occurrence of the predetermined disease based on the gene mutation data, the non-gene data and the corresponding weight on each gene; and

and determining the absolute risk of the object to generate the predetermined disease in the predetermined time based on the calculated risk score of the object to which the sample to be tested belongs to generate the predetermined disease, so as to predict the predetermined disease risk of the sample to be tested based on the comparison between the absolute risk and a predetermined threshold value.

2. The method of claim 1, further comprising:

obtaining a plurality of predetermined omics data associated with each sample to be tested; and

and normalizing the generated genetic mutation data of the plurality of genes of each sample to be detected, the acquired non-genetic data and the predetermined omics data to determine the synergistic influence of the genetic mutation data, the non-genetic data and the predetermined omics data on the phenotype of each sample to be detected.

3. The method of claim 1, wherein determining a weighting function indicative of the impact of mutant and non-genetic data on the phenotype of each test sample comprises:

determining a weight function of the plurality of rare variations and the synergistic effect of the plurality of rare variations on the phenotype of each test sample based on a sequence kernel association test algorithm.

4. The method of claim 1, wherein predicting a predetermined risk of disease for a test sample based on the comparison of the absolute risk to a predetermined threshold comprises:

calculating the absolute risk of the object of the sample to be detected to generate the predetermined disease within a predetermined time period based on the risk score, the risk factor distribution data and the incidence and mortality of the predetermined disease of the target population with the predetermined age;

determining whether the absolute risk is greater than or equal to a predetermined threshold;

in response to determining that the absolute risk is greater than or equal to a predetermined threshold, determining that the subject of the test sample satisfies a high risk condition with respect to a predetermined disease; and

in response to determining that the absolute risk is less than a predetermined threshold, determining that the subject of the test sample does not satisfy a high risk condition with respect to a predetermined disease.

5. The method of claim 2, wherein calculating a risk score for the subject of the test sample to develop the predetermined disease comprises:

and calculating the risk score of the object of the sample to be detected for the predetermined disease based on the weighted sum of the genetic mutation data and the corresponding genetic mutation weight on each gene, the product of the non-genetic data and the corresponding non-genetic data weight, and the product of the predetermined omics data and the corresponding omics data weight.

6. The method of claim 1, wherein the non-genetic data is generated based on at least one of microbial data, cellular data, clinical information, immune marker information, image data of a target region of a subject to which a test sample belongs, associated with the test sample.

7. The method of claim 2, the predetermined omics data comprises at least one of transcriptomic data, proteomic data, and metabolomic data.

8. The method of claim 5, further comprising:

for each non-genetic data, superimposing the non-genetic data sub-weights corresponding to each gene to generate a corresponding non-genetic data weight for each non-genetic data; and

for each predetermined omics data, the omics data sub-weights corresponding to each gene are superimposed to generate a corresponding omics data weight for each predetermined omics data.

9. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit causing the computing device to perform the method of any of claims 1-8.

10. A computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the method of any one of claims 1 to 8.

Technical Field

The present disclosure relates generally to biological information processing, and in particular, to methods, computing devices, and storage media for predicting risk of a predetermined disease.

Background

Even with the great progress made in modern medicine, some complex predetermined diseases (such as, without limitation, cancer) remain one of the important causes of human death. Cancer results from a multi-stage process of normal cell to tumor cell, a process is required to progress from precancerous lesions to malignant tumors, and it is common for the clinical manifestations to appear later. If the predetermined disease case can be detected and treated early, the mortality rate of cancer patients can be reduced to some extent. Therefore, early and accurate prediction of risk of a predetermined disease (e.g., cancer) becomes of particular importance.

Conventional approaches for early screening or predicting the risk of a predetermined disease (e.g., cancer) mainly include: imaging screening, endoscopic screening, tumor marker screening, and multi-gene risk score (PRS) prediction. For imaging screening, it usually requires a tumor to reach a certain size before effective detection, so that there is a hysteresis, and in addition, the ultrasound, X-ray, CT, MRI and other testing means used for imaging screening are usually accompanied by radiation. For endoscopic screening, gastroscopes, colorectal scopes, cystoscopes and the like used in the endoscopic screening have the defects of strong invasiveness, easy discomfort and cross infection, and relatively limited detection range. For screening tumor markers, the false negative rate and the false positive rate are high. While the Polygenic Risk Score (PRS) is primarily a method of identifying different levels of risk with respect to the occurrence of a predetermined disease by calculating a score weighted by the magnitude of the effect of a plurality of Single Nucleotide Polymorphisms (SNPs) to calculate a risk score. However, PRS reflects the probability of genetic risk, and for the vast majority of complex diseases (including cancer, for example) to occur as a result of genetic and environmental interactions, such as smoking, drinking, air pollution, etc., PRS techniques that consider only genetic risk have difficulty accurately predicting an individual's risk of developing a predetermined disease.

In summary, the conventional scheme for predicting the risk of the predetermined disease has the defects of strong invasiveness, obvious hysteresis, poor accuracy and the like, and the risk of developing the predetermined disease is difficult to be predicted more early and accurately for an individual.

Disclosure of Invention

The present disclosure provides a method, computing device and computer storage medium for predicting a risk of a predetermined disease, which is capable of predicting a risk of developing a predetermined disease more early and more accurately for an individual.

According to a first aspect of the present disclosure, a method for predicting the risk of a predetermined disease risk is provided. The method comprises the following steps: acquiring gene sequencing data of a plurality of samples to be detected so as to generate gene mutation data of a plurality of genes of each sample to be detected;

acquiring a plurality of non-genetic data associated with each sample to be tested; determining, via a regression algorithm, a weight function indicative of the impact of the genetic mutation and the non-genetic data on the phenotype of each test sample; calculating the risk score of the object of each sample to be detected for the occurrence of the predetermined disease based on the gene mutation data, the non-gene data and the corresponding weight on each gene; and determining the absolute risk of the object generating the predetermined disease in the predetermined time based on the calculated risk score of the object of the sample to be tested generating the predetermined disease, so as to predict the predetermined disease risk of the sample to be tested based on the comparison of the absolute risk and the predetermined threshold.

According to a second aspect of the present invention, there is also provided a computing device comprising: at least one processing unit; at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the computing device to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a computer-readable storage medium. The computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

In some embodiments, the method for predicting a predetermined disease risk further comprises: obtaining a plurality of predetermined omics data associated with each sample to be tested; and normalizing the generated genetic mutation data of the plurality of genes of each sample to be tested, the acquired non-genetic data and the predetermined omics data to determine the synergistic influence of the phenotype of each sample to be tested on the genetic mutation, the non-genetic data and the predetermined omics data.

In some embodiments, determining a weighting function indicative of the impact of the mutant and non-genetic data on the phenotype of each test sample comprises: determining a weight function of the plurality of rare variations and the synergistic effect of the plurality of rare variations on the phenotype of each test sample based on a sequence kernel association test algorithm.

In some embodiments, predicting a predetermined risk of disease for the test sample based on the comparison of the absolute risk and the predetermined threshold comprises: calculating the absolute risk of the object of the sample to be detected to generate the predetermined disease within a predetermined time period based on the risk score, the risk factor distribution data and the incidence and mortality of the predetermined disease of the target population with the predetermined age; determining whether the absolute risk is greater than or equal to a predetermined threshold; in response to determining that the absolute risk is greater than or equal to a predetermined threshold, determining that the subject of the test sample satisfies a high risk condition with respect to a predetermined disease; and in response to determining that the absolute risk is less than the predetermined threshold, determining that the subject of the test sample does not satisfy a high risk condition with respect to the predetermined disease.

In some embodiments, calculating a risk score for the subject of the test sample to develop the predetermined disease comprises: and calculating the risk score of the object of the sample to be detected for the predetermined disease based on the weighted sum of the genetic mutation data and the corresponding genetic mutation weight on each gene, the product of the non-genetic data and the corresponding non-genetic data weight, and the product of the predetermined omics data and the corresponding omics data weight.

In some embodiments, the non-genetic data is generated based on at least one of microbial data, cellular data, clinical information, immune marker information, image data of a target region of a subject to which the test sample belongs, associated with the test sample.

In some embodiments, the predetermined omics data comprises at least one of transcriptomic data, proteomic data, and metabolomic data.

In some embodiments, for each non-genetic data, superimposing a non-genetic data sub-weight corresponding to each gene to generate a corresponding non-genetic data weight for each non-genetic data; and for each predetermined omics data, superimposing the omics data sub-weights corresponding to each gene to generate a corresponding omics data weight for each predetermined omics data.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system for a method for predicting risk of a predetermined disease according to an embodiment of the present disclosure.

Fig. 2 shows a flow diagram of a method for predicting risk of a predetermined disease according to an embodiment of the present disclosure.

Figure 3 shows a flow diagram of a method for determining genetic mutations, a plurality of non-genetic data obtained, and a plurality of predetermined omics data and phenotypic effect impact, according to an embodiment of the present disclosure.

Fig. 4 shows a flow chart of a method for predicting a predetermined disease risk of a test sample according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a method for comparing an absolute risk to a predetermined threshold according to an embodiment of the present disclosure.

FIG. 6 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As mentioned previously, conventional approaches for early screening or predicting the risk of a predetermined disease (e.g., without limitation, cancer) mainly include: imaging screening, endoscopic screening, tumor marker screening, and multi-gene risk score (PRS) prediction. The above methods have the disadvantages of being highly invasive, having significant hysteresis, or having poor accuracy, and therefore, conventional approaches for early screening or predicting the risk of a predetermined disease have difficulty in achieving an earlier and more accurate prediction of the risk of developing a predetermined disease for an individual.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for predicting risk of a predetermined disease. In the scheme, gene mutation data of a plurality of genes of a sample to be detected is generated based on gene sequencing data, and a plurality of non-gene data associated with the sample to be detected is obtained; a weight function indicative of the impact of the genetic mutation and non-genetic data on the phenotype of each test sample is then determined via a regression algorithm. The method can obtain the synergistic influence of different data sources such as gene mutation data, non-gene data and the like on the phenotype of the sample to be detected. In addition, the present disclosure calculates, for each test sample, a risk score of the subject of the test sample for developing a predetermined disease based on the genetic mutation data, the non-genetic data, and their corresponding weights on each gene; the risk prediction of single gene dimensionality can be broken through by the method, and on the basis of the system medical principle, multi-source information such as multi-gene and multi-non-gene factors is collected through a model algorithm, so that a predetermined disease signal is captured to the maximum extent. It is understood that the onset of the predetermined disease is produced by the combined action of multiple genes and multiple causative agents, as well as progression through multiple stages of onset. When the predetermined ailment occurs early, weak predetermined ailment signals may occur in various parts of the body, embodied in different data sources. Therefore, the method can predict the risk of developing the predetermined disease for the individual to be tested more early and accurately.

Fig. 1 shows a schematic diagram of a system 100 for a method of predicting risk of a predetermined disease according to an embodiment of the present disclosure. As shown in fig. 1, system 100 includes, for example, a computing device 110, a sequencing device 130, a server 140, a user terminal 150, and a network 160. The computing device 110 may interact with the computing device 110, the sequencing device 130, the server 140, and the user terminal 150 in a wired or wireless manner via the network 160.

The computing device 110, for example, a confidence server, is used to predict a predetermined risk of disease. In particular, the computing device 110 can obtain genetic sequencing data for a plurality of samples to be tested, e.g., from the sequencing device 130, obtain a plurality of non-genetic data and predetermined omics data associated with each sample to be tested, e.g., from the server 140 and the user terminal 150. The computing device 110 may also generate gene mutation data for a plurality of genes for each sample to be tested based on the obtained gene sequencing data; and determining, via a regression algorithm, a weighting function indicative of the impact of the genetic mutation and the non-genetic data on the phenotype of each test sample; and calculating the risk score of the object of the sample to be detected for the predetermined disease based on the gene mutation data, the non-gene data and the corresponding weight on each gene. The computing device 110 may also determine an absolute risk of the subject of the sample under test for developing the predetermined disease within a predetermined time based on the calculated risk score of the subject for developing the predetermined disease for predicting the predetermined disease risk of the sample under test.

In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device.

The calculation device 110 includes, for example, a gene mutation data generation unit 112, a non-gene data acquisition unit 114, a weight function determination unit 116, a test sample risk score calculation unit 118, and a predetermined disease risk prediction unit 120. The genetic mutation data generation unit 112, the non-genetic data acquisition unit 114, the weight function determination unit 116, the test sample risk score calculation unit 118, and the predetermined disease risk prediction unit 120 described above may be configured on one or more of the computing devices 110.

And a gene mutation data generation unit 112 for acquiring gene sequencing data on the plurality of test samples to generate gene mutation data of the plurality of genes of each test sample.

Regarding the non-genetic data acquisition unit 114, it is used to acquire a plurality of non-genetic data associated with each sample to be tested.

Regarding the weighting function determination unit 116, it is used for determining, via a regression algorithm, a weighting function for indicating the influence of the genetic mutation data and the non-genetic data on the phenotype of each test sample.

And a test sample risk score calculation unit 118 for calculating, for each test sample, a risk score of the subject of the test sample suffering from the predetermined disease based on the genetic mutation data, the nongenic data, and the corresponding weight on each gene.

And a predetermined disease risk prediction unit 120 for determining an absolute risk of the subject of the sample to be tested for the occurrence of the predetermined disease within a predetermined time based on the calculated risk score of the subject of the sample to be tested for the occurrence of the predetermined disease, so as to predict the predetermined disease risk of the sample to be tested based on a comparison of the absolute risk and a predetermined threshold.

As for the sequencing device 130, it is used for performing gene sequencing based on a sample to be tested (e.g., a blood sample or a tissue sample) of a subject to be tested, so as to generate gene sequencing data on a plurality of samples to be tested based on the sequencing result, and transmit the generated gene sequencing data to the computing device 110.

The server 140, for example and without limitation a server of a medical institution, may send non-genetic data (e.g., at least one of microbial data, cellular data, clinical information, immune marker information, image data of a target region of a subject to which a test sample belongs, which is associated with the test sample) and predetermined omics data (at least one of transcriptomic data, proteomic data, and metabolomic data) about a user to the computing device 110.

The user terminal may transmit to the computing device 110, for example, partial non-genetic data of the user, such as age, gender, family history of a predetermined disease, and the like.

A method for predicting risk of a predetermined disease according to an embodiment of the present disclosure will be described below with reference to fig. 2. Fig. 2 shows a flow diagram of a method 200 for predicting risk of a predetermined disease according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 202, the computing device 110 obtains gene sequencing data for a plurality of samples to be tested to generate gene mutation data for a plurality of genes for each sample to be tested.

Gene sequencing data on a plurality of samples to be tested (e.g., tissue samples or blood samples) is generated based on high-throughput sequencing, for example.

The method for generating gene mutation data of a plurality of genes of each sample to be tested includes, for example, data preprocessing, data comparison and quality control, mutation identification and mutation annotation for the obtained gene sequencing data. The tumor gene mutation types comprise point mutation, insertion, deletion, gene rearrangement, copy number abnormality and other gene mutation.

Methods for quality control of gene sequencing data include, for example: quality control is performed based on the Base Call Quality scores, the effective Reads number after filtering, the effective sequencing depth, and the like. In some embodiments, to improve the quality of generating mutations in a gene, mutagenesis may be repeated twice. For example, after the gene mutation data is first generated, extraction, filtering and basic mass fraction checking are performed for single-site, insertion and deletion mutations. The regenerated gene mutation data is used to predict a predetermined disease risk.

At step 204, the computing device 110 obtains a plurality of non-genetic data associated with each sample to be tested. In some embodiments, the computing device 110 obtains predetermined omics data associated with each test sample in addition to the plurality of non-genetic data associated with each test sample. The non-genetic data is generated based on at least one of microbial data, cellular data, clinical information, immune marker information, and image data of a target region of a subject to which the test sample belongs, for example, in association with the test sample.

At step 206, the computing device 110 determines, via a regression algorithm, a weight function indicative of the impact of the genetic mutation and the non-genetic data on the phenotype of each test sample.

It will be appreciated that the occurrence of a predetermined disease, such as cancer, is often the result of a combination of genetic and environmental factors. A regression model representative for determining the associative impact of a plurality of genetic variations and non-genetic data on a sample phenotype is described below in connection with equation (1). The regression model is, for example, a logistic regression model.

logit(Y_i＝1)＝τS_i+γ^TZ_i (1)

In the above formula (1), i represents the ith sample, i ═ 1. S_iRepresenting weighted genetic variation data G_iLinear combinations of (3). S_i＝ξ^TG_i. Xi represents a weight, which is an m × 1 weight vector, i.e., xi ═ xi (xi)₁,...,ξ_m)^T。G_iRepresents the observed gene mutation on the ith sample, which may be a rare gene mutation, which may also be a common gene mutation. G_ijRepresents the j gene mutation on the i sample. J ═ 1.. J. τ represents a coefficient, which is constant. Gamma ray^TRepresentative of non-genetic data Z from the i-th sample_iThe associated regression coefficient vector. Z_iRepresenting non-genetic data associated with the ith sample, Z_ijRepresenting j non-genetic data on the ith sample.

The method for determining the weight function is described below in conjunction with equations (2) to (5). At null hypothesis H₀Where τ is 0, the score statistics may be expressed as.

In the above-mentioned formula (2),representing constrained maximum likelihood estimation using gamma. U represents the score statistic. By solving the following equation (3), the variance of U can be estimated using the following equations (4) and (5).

In the case of a null hypothesis, the test statistic T ═ U/V^1/2The approximation follows a standard normal distribution.

In some embodiments, the method of determining a weighting function for indicating the impact of mutation data on a phenotype, for example, comprises: determining a weight function of the plurality of rare variations and the synergistic effect of the plurality of rare variations on the phenotype of each test sample based on a sequence kernel association test algorithm.

It is understood that traditional genome-wide association studies (genome-wide association studies) typically identify common variations (e.g., genetic variations with MAF greater than 5%) associated with a predetermined disease (e.g., a complex disease), but miss the impact of rare variations (e.g., without limitation, genetic variations with MAF less than or equal to 5% or 1%) on the complex disease. This is mainly because GWAS is based on mixed linear models to examine genetic variation one by one, however, because rare variations occur only with low frequency, it is less effective to detect rare variations. The research shows that the rare variation and the common variation jointly act on the phenotype of the sample to be tested and have correlation effect on non-gene data.

As to the Sequence Kernel Association Test (SKAT), it is one of the variance component tests. SKAT assumes that the regression coefficient for each genotype obeys a mean value of 0 and a variance of w_jArbitrary distribution (w) of tau_jIs the weight of the given variation j and τ is the variance component).

When zero is assumed, τ is 0. A variance component score test may be used. The corresponding variance component score statistics may be expressed in the manner shown in equation (6) below.

In the above formula (6), K ═ GWG', which represents a weighted linear function, represents genetic similarity between individuals. G is an n × p matrix, the value of the (i, j) position is the value of the gene mutation j on sample i, W ═ diag (ξ)₁,...，ξ_m) The weight of the m mutation sites.Is the estimated average value of y under the null hypothesis. Under the null hypothesis, for a patient with a predetermined disease,if for the value y of the continuity a, andare estimated by a regression model under the null hypothesis. In the case of the null hypothesis, Q satisfies a mixed chi-squared distribution, which can be calculated with a computationally efficient Davies method. Xi_mIs a pre-given weight of the variance m, which reflects the relative contribution of the variance m to the variance component score statistics.

The sequence kernel relevance test may extract weight values from the beta distribution. The weight value ξ is explained below in conjunction with equation (7)_mThe calculation method of (1).

In the above formula (7), a1 and a2 are empirically determined in advance. In the sequence core correlation test algorithm, default a 1-1 and a 2-25. E.g. weight of rare variation and low frequency variation (1% < MAF < 5%). A genetic variation weight corresponding to the rare variation can be determined.

By adopting the above means, the present disclosure can comprehensively consider the influence of common variation and rare variation on the predetermined disease risk, and thus, the predetermined disease risk can be predicted more accurately. Moreover, by using SKAT, the present disclosure can directly fit and adjust the relationship between the phenotype when non-genetic data is obtained and the genetic variation data (including common variations and rare variations) within the gene, and thus can significantly improve the efficacy of the method for determining rare variations, the synergistic effect of common variations on the phenotype.

At step 208, the computing device 110 calculates, for each sample to be tested, a risk score for the subject of the sample to be tested to develop a predetermined disease based on the genetic mutation data, the non-genetic data, and the corresponding weights on each gene.

A method for calculating a risk score of a subject of a test sample for developing a predetermined disease, for example, includes: and calculating the risk score of the object of the sample to be detected for the predetermined disease based on the weighted sum of the genetic mutation data and the corresponding genetic mutation weight on each gene, the product of the non-genetic data and the corresponding non-genetic data weight, and the product of the predetermined omics data and the corresponding omics data weight. It is understood that the onset of the predetermined disease is developed through the development of multiple stages through the combined action of multiple genes and multiple causative agents. The genetic information is expressed in different states as genomics, transcriptomics, proteomics, metabonomics and the like. When the risk data of a sample to be detected is calculated, the combined influence of the polygenic mutation and the nongenic data in the sequencing technology on the preset disease risk is considered, and the combined influence of the polygenic mutation, the nongenic factor and other preset omics data on the preset disease risk is also considered, so that the risk of the preset disease is predicted more accurately.

The algorithm for calculating the risk data of the sample to be tested is described below in connection with equation (8). Assume a total of n samples, q nongenic data, L genes, with p on gene L (L ═ 1.., L)_lAnd (4) mutation points.

MRS＝α₀+α′₁G_i1+α′₂G_i2+…+α′_LG_iL+β′X_i+γ′Z_i+∈_i, (8)

In the above formula (8), α₀The representative of the intercept is that of the line,represents the gene mutation weight for the l-th gene. For example, p is found in the l-th gene_lIndividual gene mutation. G_ijRepresents the j gene mutation on the i sample. Beta is ═ beta₁，...,β_q]' represents a non-radicalDue to the data weight. X_iRepresenting non-genetic data associated with the ith sample. Gamma-gamma₁,...γ_m]' represents the omics data weight associated with the ith sample. Z_ijRepresenting j non-genetic data on the ith sample. MRS represents the risk score for the test sample to develop a predetermined disease. MRS is a risk score for a predetermined disease based on polygenic, multinongenic data, and multigenomics.

The following description will be made with reference to equation (9) with respect to the nongenic data weight β ═ β₁,...,β_q]The manner of calculation of.

In the above formula (9), q represents the number of nongenic data.Represents the corresponding non-genetic data weight with the q-th non-genetic data. The corresponding nongenic data weight beta_qThe non-genetic data is generated by superimposing, for each non-genetic data, a sub-weight of the non-genetic data corresponding to each gene. It will be appreciated that, for example, q non-genetic data (e.g., age, family history of predetermined disease, microbial data, clinical information, immunological marker information, image data of a target region to which a test sample belongs, etc.) are the same for each subject to which the test sample belongs. However, since different nongenic data weights are generated for each of the L genes in consideration of the influence of their covariant data (nongenic data), the nongenic data weights for the L genes are superimposed to generate q nongenic data weights with which q nongenic data are associated.

The following description is given with reference to equation (10) with respect to omics data weight γ ═ γ₁,...γ_m]The manner of calculation of.

In the above equation (10), m represents the number of predetermined omics data.A corresponding omics data weight representing the mth predetermined omics data. The omics data weight gamma_mGenerated by superimposing the omics datasub-weights corresponding to each gene for the L genes in the sample to be tested.

At step 210, the computing device 110 determines an absolute risk of the subject of the sample under test of the predetermined disease occurring within a predetermined time based on the calculated risk score of the subject of the sample under test of the predetermined disease, for predicting the predetermined disease risk of the sample under test based on a comparison of the absolute risk and a predetermined threshold.

With respect to absolute risk (absolute risk), it indicates the probability that a subject of a test sample having a certain set of risk factors (which includes genetic variation data, non-genetic data, and their interaction terms) does not develop a predetermined disease at a predetermined age but develops the predetermined disease within a predetermined period of time.

As for the predetermined period of time, it is, for example, 5 years, 10 years, 15 years, or a lifetime (i.e., a life span of a subject to which the sample to be measured belongs).

Regarding the method for predicting a predetermined disease risk of a test sample, it includes, for example: calculating the absolute risk of the object of the sample to be detected to generate the predetermined disease within a predetermined time period based on the risk score, the risk factor distribution data and the incidence and mortality of the predetermined disease of the target population with the predetermined age; determining whether the absolute risk is greater than or equal to a predetermined threshold; in response to determining that the absolute risk is greater than or equal to a predetermined threshold, determining that the subject of the test sample satisfies a high risk condition with respect to a predetermined disease; and in response to determining that the absolute risk is less than the predetermined threshold, determining that the subject of the test sample does not satisfy a high risk condition with respect to the predetermined disease. The method for predicting the risk of a predetermined disease in a sample to be tested will be described with reference to fig. 4, and will not be described herein.

In the above-described aspect, the gene mutation data on the plurality of genes of the sample to be tested is generated based on the gene sequencing data, and the plurality of non-gene data associated with the sample to be tested is acquired; a weight function indicative of the impact of the genetic mutation and non-genetic data on the phenotype of each test sample is then determined via a regression algorithm. The method can obtain the synergistic influence of different data sources such as gene mutation data, non-gene data and the like on the phenotype of the sample to be detected. In addition, the present disclosure calculates, for each test sample, a risk score of the subject of the test sample for developing a predetermined disease based on the genetic mutation on each gene, the non-genetic data, and their corresponding weights; the risk prediction of single gene dimensionality can be broken through by the method, and on the basis of the system medical principle, multi-source information such as multi-gene and multi-non-gene factors is collected through a model algorithm, so that a predetermined disease signal is captured to the maximum extent. It is understood that the onset of the predetermined disease is produced by the combined action of multiple genes and multiple causative agents, as well as progression through multiple stages of onset. When the predetermined ailment occurs early, weak predetermined ailment signals may occur in various parts of the body, embodied in different data sources. Therefore, the method can predict the risk of developing the predetermined disease for the individual to be tested more early and accurately.

A method 300 for determining multi-gene, multi-nongenic data, multi-set mathematical data, and phenotypic effect impact according to an embodiment of the present disclosure will be described below in conjunction with fig. 3. Figure 3 shows a flow diagram of a method 300 for determining gene mutation data, a plurality of non-genetic data obtained, and a plurality of predetermined omics data and phenotypic effect impact, according to an embodiment of the present disclosure. It should be understood that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 302, the computing device 110 obtains a plurality of predetermined omics data associated with each sample under test. Predetermined omics data include, for example and without limitation: transcriptomic data, proteomic data, and metabolomic data.

At step 304, the computing device 110 normalizes the generated genetic mutation data, the obtained non-genetic data, and the predetermined omics data for the plurality of genes for each test sample to determine the synergistic effect of the phenotype of each test sample of the genetic mutation, the non-genetic data, and the predetermined omics data. By adopting the technical means, the risk of developing the predetermined disease can be predicted earlier and more accurately aiming at the individual to be detected through the combined action of multiple pathogenic factors such as multiple genes, multiple nongenic risks, multiple configuration data and the like.

Methods for predicting risk of developing a disease within a predetermined time period according to embodiments of the present disclosure will be described below in conjunction with fig. 4 and 5 fig. 4 shows a flow chart of a method 400 for predicting a predetermined risk of a disease of a sample to be tested according to embodiments of the present disclosure. Fig. 5 shows a schematic diagram of a method 500 for comparing an absolute risk to a predetermined threshold according to an embodiment of the present disclosure. It should be understood that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 402, the computing device 110 calculates an absolute risk of the subject of the test sample developing the predetermined disease within a predetermined time period based on the risk score, the risk factor distribution data, the incidence of the predetermined disease and the mortality rate of the target population at a predetermined age with respect to the test sample.

The method for predicting the incidence of a specific age is described below in conjunction with formula (11).

In the above formula (11), I (a | Z) represents the probability of occurrence of a disease at a predetermined age a. a represents a predetermined age. Z ═ Z (Z)₁,…,Z_k) Represent risk factors, which include genetic variation (e.g., genetic risk factors), non-genetic data (e.g., environmental risk factors), and their interaction terms. In some embodiments, the risk factors further comprise predetermined omics data.Representing a risk score for the sample to be tested. I is₀(a) Represents an average baseline risk (average baseline hazard) for the target population of a predetermined age to develop the predetermined disease, which is indicative of, for example, an average risk level for the target population of the same age to develop the outcome of the predetermined disease. Ea represents risk factor distribution data, which typically needs to be calculated from the distribution of risk factors Z in disease-free basal populations (to age a not to die for other reasons).

A method for determining absolute risk over a predetermined time period is described below in connection with equation (12).

In the above formula (12), m (v | Z) represents the mortality at a predetermined age due to other reasons. u represents a given age at which it is assumed that the subject still has no predetermined disease and does not die for other reasons before the given age. R_a,a+sRepresenting that the current age a of the object to be measured will be within a predetermined time period [ a, s + a ]]The absolute risk of developing a disease. The absolute risk over a predetermined period of time is determined by the sum of the probabilities of disease occurrence for all given ages u over the predetermined period of time.

At step 404, the computing device 110 determines whether the absolute risk is greater than or equal to a predetermined threshold.

As shown in fig. 5, marker 510 indicates the absolute risk curve for breast cancer within 10 years for women between 25 and 70 years of age with a risk score of less than 1% for the sample to be tested. Marker 512 indicates the absolute risk curve for breast cancer development within 10 years with a risk score of 1-5% for the sample to be tested. Marker 514 indicates the absolute risk curve for developing breast cancer within 10 years with a risk score of 5-10% for the sample to be tested. Marker 516 indicates the absolute risk curve for developing breast cancer within 10 years with a risk score of 10-20% for the sample to be tested. Marker 518 indicates the absolute risk curve for breast cancer development within 10 years with a risk score of 20-40% for the sample to be tested. Marker 520 indicates the absolute risk curve for developing breast cancer within 10 years with a risk score of 40-60% for the sample to be tested. Marker 522 indicates the absolute risk curve for the development of breast cancer within 10 years with a risk score of 10-20% for the sample to be tested. Marker 524 indicates the absolute risk curve for breast cancer occurring within 10 years with a risk score of 80-90% for the sample to be tested. Marker 526 indicates the absolute risk curve for breast cancer occurring within 10 years with a risk score of 90-95% for the sample to be tested. Marker 528 indicates the absolute risk curve for developing breast cancer within 10 years with a risk score of 95-99% for the test sample. Marker 530 indicates the absolute risk curve for developing breast cancer within 10 years with a risk score of greater than 99% for the sample to be tested. The marker 540 indicates a predetermined threshold. The predetermined threshold is, for example, 2.6%.

At step 406, if the computing device 110 determines that the absolute risk is greater than or equal to the predetermined threshold, it is determined that the subject of the sample to be tested satisfies a high risk condition with respect to the predetermined disease. For example, if the calculated absolute risk of developing breast cancer within 10 years of the subject to which the test sample belongs is determined to be greater than 2.6%, the risk of developing the predetermined disease is determined to be high.

At step 408, if the computing device 110 determines that the absolute risk is less than the predetermined threshold, it is determined that the subject of the sample to be tested does not satisfy a high risk condition with respect to the predetermined disease. That is, if it is determined that the calculated absolute risk of developing breast cancer within 10 years of the subject to which the sample to be tested belongs is less than 2.6%, the risk of developing the predetermined disease is determined to be low.

Fig. 5 indicates the absolute risk of 10-year breast cancer at different ages and indicates the age at which women at different levels of risk scores for the test sample reach a predetermined threshold (e.g., 2.6%).

FIG. 6 schematically illustrates a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure. The device 600 may be a device for implementing the methods 200, 300, and 400 shown in fig. 2, 3, and 4. As shown in fig. 6, device 600 includes a Central Processing Unit (CPU)601 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM, various programs and data required for the operation of the device 600 may also be stored. The CPU, ROM, and RAM are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, the central processing unit 601 performs the various methods and processes described above, such as performing the methods 200, 300, and 400. For example, in some embodiments, the methods 200, 300, and 400 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM and/or the communication unit 609. When loaded into RAM and executed by a CPU, the computer program may perform one or more of the operations of methods 200, 300 and 400 described above. Alternatively, in other embodiments, the CPU may be configured by any other suitable means (e.g., by way of firmware) to perform one or more of the acts of methods 200, 300, and 400.

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are only alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种预测肿瘤炎性小体活性状态及治疗敏感性的基因集系统及方法

Method, computing device and storage medium for predicting risk of predetermined disease

相关技术

网友询问留言