Disease prediction method and system based on spatial separability and using gene detection

文档序号:1044855 发布日期:2020-10-09 浏览:19次 中文

阅读说明:本技术 基于空间可分离性的利用基因检测的疾病预测方法及系统 (Disease prediction method and system based on spatial separability and using gene detection ) 是由 杜强 李德轩 郭雨晨 聂方兴 张兴 唐超 于 2020-07-01 设计创作,主要内容包括:本发明公开了一种基于空间可分离性的利用基因检测的疾病预测方法及系统,该方法利用基因对应不同病种之间的数值范围是否重叠和离散程度,来判断该基因对于各个病种是否有区分能力;根据每个基因对应的各个病种数值范围的平均值,算出一个离散值;最后利用这个离散值进行排序,获得显著基因并计算得到各显著基因的阈值;根据每个显著基因的数值与阈值进行比较,得到每个病种对应的最终预测得分来判断患病的几率。本发明提出基于空间可分离性的显著基因提取方法,大大减少需要检测的基因数量,降低基因检测所需的费用和减轻医生们的压力,同时兼顾计算量和准确率,从一定程度上推动了基因检测的发展和普及。(The invention discloses a disease prediction method and a system based on spatial separability and using gene detection, wherein the method judges whether the gene has distinguishing capability for each disease species by using the overlapping and discrete degree of numerical ranges of the gene corresponding to different disease species; calculating a discrete value according to the average value of the numerical range of each disease species corresponding to each gene; finally, sequencing is carried out by utilizing the discrete value to obtain significant genes and the threshold value of each significant gene is obtained through calculation; and comparing the value of each significant gene with a threshold value to obtain a final prediction score corresponding to each disease species so as to judge the disease probability. The invention provides a significant gene extraction method based on spatial separability, which greatly reduces the number of genes to be detected, reduces the cost required by gene detection, relieves the pressure of doctors, considers the calculation amount and the accuracy rate, and promotes the development and popularization of gene detection to a certain extent.)

1. A disease prediction method using gene detection based on spatial separability, comprising the steps of:

obtaining human body genes and gene detection data; the system determines the value ranges of various disease species corresponding to each gene through the MAX function, compares whether the value range ranges of the disease species of each gene are overlapped, and determines to obtain the gene capable of screening the specific disease species; the system extracts the numerical value of each gene capable of screening a specific disease species, and carries out standardization processing on the corresponding data of the data of each gene; according to the data of the row with the genes capable of identifying the specific disease species after the standardization processing, the system calculates the mean value of the numerical range of each disease species corresponding to each gene capable of identifying the specific disease species and the mean value of the numerical ranges of all disease species corresponding to the gene; calculating the distance expectation of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical ranges capable of screening the genes, namely obtaining the discrete values of the genes capable of screening the specific disease species; sequencing the discrete values, and removing genes which are obviously lower than 0.1 and can discriminate specific disease species to obtain obvious genes; classifying the significant genes according to disease species which can be discriminated, and regularizing discrete values of the same significant genes to obtain regularized discrete values; the system obtains the value range of the disease species which can be screened by each significant gene and the value range of the remaining disease species through the calculation of the python basic command statement, and determines the threshold value of each significant gene; setting an initial predetermined score of 0 for the disease species corresponding to each significant gene; the system compares the numerical value of each significant gene with a threshold value to obtain a final prediction score corresponding to each disease species, and judges the disease probability according to the prediction scores.

2. The method of claim 1, wherein the system determines the range of each disease type corresponding to each gene by MAX function, and determines whether the range of each disease type of each gene overlaps with the range of each disease type of each gene, and the method for predicting disease using gene detection based on spatial separability specifically comprises:

traversing the gene detection data by using a gene as a unit, determining the maximum value and the minimum value of various disease species corresponding to each gene through a MAX function, and determining the range of a value range; and comparing whether the value range ranges of the disease species corresponding to the single gene are overlapped or not, wherein if the value range ranges of the disease species corresponding to the single gene are not overlapped, the gene is the gene capable of identifying the specific disease species.

3. The method of claim 1, wherein the normalizing the data of the line data of each corresponding gene comprises the steps of:

the method comprises the steps of firstly calculating the maximum value and the minimum value of single line data corresponding to a gene capable of identifying a specific disease species through a max function, subtracting the minimum value from the maximum value to obtain a difference value, subtracting the minimum value from the line data, and dividing the difference value to obtain a regular value of the line data.

4. The method of claim 1, wherein the regularizing discrete values of the same class of significant genes comprises the steps of:

and determining the maximum value of the discrete values of all the significant genes corresponding to the disease, and dividing the discrete value of each significant gene by the maximum value to obtain the normalized discrete value.

5. The method of claim 1, wherein the determining the threshold for each significant gene comprises the steps of:

the system calculates the maximum value a and the minimum value b of the numerical range of the disease species which can be screened by the significant gene through the basic command statement of python;

and then calculating the maximum value c and the minimum value d of the numerical range of the remaining disease species, wherein the threshold confirmation of each significant gene comprises the following two conditions:

the first case is b > c, then the threshold for significant genes is equal to b- (a-b)/(a-b + c-d) × (b-c);

the second case is when a < d, then the threshold for significant genes is equal to a + (a-b)/(a-b + c-d) × (d-a).

6. The method of claim 1, wherein the system compares the value of each significant gene with its threshold to obtain a final prediction score for each disease type, and comprises the following two cases:

in the first case: the maximum value of the numerical range of disease species which can be screened by the significant gene is smaller than the threshold value, so that the final score of the disease species is the predicted score plus the regularized discrete value corresponding to the significant gene; otherwise, the significant gene is discarded;

in the second case: if the minimum value of the numerical range of the disease species which can be screened by the significant gene is larger than the threshold value, the final score of the disease species is that the prediction score should be added with the regularized discrete value corresponding to the significant gene, otherwise, the significant gene is discarded.

7. A disease prediction system using gene detection based on spatial separability, comprising:

gene data input unit: used for inputting human genes and gene detection data;

a gene screening unit for screening a specific disease species: the device is used for determining the value ranges of various disease types corresponding to each gene input by the gene data input unit through the MAX function, and comparing whether the value ranges of the disease types of each gene are overlapped or not, so that the gene capable of screening a specific disease type is obtained;

a normalization processing unit: extracting the numerical value of each gene capable of identifying a specific disease species, and carrying out standardization processing on the corresponding line data of each gene;

a discrete value calculation unit: the system is used for calculating the mean value of the numerical range of each disease species corresponding to each gene capable of screening the specific disease species and the mean value of the numerical ranges of all disease species corresponding to the gene according to the line data which are standardized by the standardization processing unit and respectively have the genes capable of screening the specific disease species; calculating the distance expectation of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical ranges capable of screening the genes, namely obtaining the discrete values of the genes capable of screening the specific disease species;

a significant gene acquisition unit: sequencing the discrete values of the genes capable of distinguishing the specific disease species obtained by the discrete value calculation unit, and removing the genes which are obviously lower than 0.1 and capable of distinguishing the specific disease species to obtain obvious genes;

significant gene classification unit: the system is used for classifying the significant genes obtained by the significant gene acquisition unit according to disease species which can be identified, classifying the significant genes which can be identified in the same disease species into the same class and regularizing discrete values of the significant genes to obtain regularized discrete values;

significant gene threshold value calculation unit: the method comprises the steps of calculating a value range of a disease species which can be screened by each significant gene and a value range of a remaining disease species through a python basic command statement, and determining a threshold value of each significant gene;

a disease prediction unit: and the method is used for comparing the numerical value of each significant gene with a threshold value according to the preset initial prediction score of the disease species corresponding to each significant gene to obtain the final prediction score corresponding to each disease species, and judging the disease probability according to the prediction scores.

8. Computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor when executing said computer program implements a method for disease prediction using genetic testing based on spatial separability according to any of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for disease prediction using genetic testing based on spatial separability according to any one of claims 1 to 6.

Technical Field

The invention relates to a disease prediction technology, in particular to a disease prediction method and a system based on spatial separability and utilizing gene detection.

Background

The gene determines the life, the old and the death of the human, is the reason of health, disease and longevity and is the determining factor of life quality. Genetic testing can diagnose existing physical diseases and predict disease risk. The risk of disease development can be detected before the disease development by using gene detection technology. In recent years, with the progress of science, genetic testing has become less and less accessible and the cost has become lower, and thus scientists have begun to predict diseases that may exist in the human body using genetic testing.

With the increasing understanding of the genetic origin of diseases by scientists, the number of pathogenic mutant genes known by scientists is rapidly increasing, and thousands of defective genes are known so far. Many diseases can be predicted by means of genetic testing. For example: through the detection of multiple tumor susceptibility genes, the method can predict gastric cancer, breast cancer, prostate cancer and the like, and through the detection of susceptibility genes of metabolic and nutritional abilities, the method can predict anemia, systemic lupus erythematosus, diabetes and the like. However, the current gene detection method usually needs to detect thousands or even tens of thousands of genes in order to accurately predict the potential diseases, which causes huge cost and consumes huge manpower. Due to such drawbacks, development and popularization of gene testing have been limited.

In addition, in the face of tens of thousands of genes and limited sample number, the deep learning method is greatly limited, firstly, because no mechanism points to specific genes, and secondly, because no enough case data support model is available for iteration, overfitting is easy to occur. In addition, because of the large number of genes, a large number of interference terms may exist in the middle, and the direct analysis may be inaccurate due to the existence of interference.

In view of the above, it is desirable to provide a disease prediction method for gene detection with high accuracy on the premise of reducing the number of detected genes.

Disclosure of Invention

In order to solve the technical problems, the technical scheme adopted by the invention is to provide a disease prediction method based on spatial separability and using gene detection, which comprises the following steps:

obtaining human body genes and gene detection data; the system determines the value ranges of various disease species corresponding to each gene through the MAX function, compares whether the value range ranges of the disease species of each gene are overlapped, and determines to obtain the gene capable of screening the specific disease species; the system extracts the numerical value of each gene capable of screening a specific disease species, and carries out standardization processing on the corresponding data of the data of each gene; according to the data of the row with the genes capable of identifying the specific disease species after the standardization processing, the system calculates the mean value of the numerical range of each disease species corresponding to each gene capable of identifying the specific disease species and the mean value of the numerical ranges of all disease species corresponding to the gene; calculating the distance expectation of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical ranges capable of screening the genes, namely obtaining the discrete values of the genes capable of screening the specific disease species; sequencing the discrete values, and removing genes which are obviously lower than 0.1 and can discriminate specific disease species to obtain obvious genes; classifying the significant genes according to disease species which can be discriminated, and regularizing discrete values of the same significant genes to obtain regularized discrete values; the system obtains the value range of the disease species which can be screened by each significant gene and the value range of the remaining disease species through the calculation of the python basic command statement, and determines the threshold value of each significant gene; setting an initial predetermined score of 0 for the disease species corresponding to each significant gene; the system compares the numerical value of each significant gene with a threshold value to obtain a final prediction score corresponding to each disease species, and judges the disease probability according to the prediction scores.

In the above method, the system determines the value ranges of various disease species corresponding to each gene by MAX function, compares whether the value ranges of various disease species of each gene overlap, and determines to obtain the gene capable of screening a specific disease species specifically as follows:

traversing the gene detection data by using a gene as a unit, determining the maximum value and the minimum value of various disease species corresponding to each gene through a MAX function, and determining the range of a value range; and comparing whether the value range ranges of the disease species corresponding to the single gene are overlapped or not, wherein if the value range ranges of the disease species corresponding to the single gene are not overlapped, the gene is the gene capable of identifying the specific disease species.

In the above method, the normalizing the data of each corresponding gene may include:

the method comprises the steps of firstly calculating the maximum value and the minimum value of single line data corresponding to a gene capable of identifying a specific disease species through a max function, subtracting the minimum value from the maximum value to obtain a difference value, subtracting the minimum value from the line data, and dividing the difference value to obtain a regular value of the line data.

In the above method, the regularizing the discrete values of the same significant class of genes includes the steps of:

and determining the maximum value of the discrete values of all the significant genes corresponding to the disease, and dividing the discrete value of each significant gene by the maximum value to obtain the normalized discrete value.

In the above method, the determining the threshold value of each significant gene comprises the following steps:

the system calculates the maximum value a and the minimum value b of the numerical range of the disease species which can be screened by the significant gene through the basic command statement of python;

and then calculating the maximum value c and the minimum value d of the numerical range of the remaining disease species, wherein the threshold confirmation of each significant gene comprises the following two conditions:

the first case is b > c, then the threshold for significant genes is equal to b- (a-b)/(a-b + c-d) × (b-c);

the second case is when a < d, then the threshold for significant genes is equal to a + (a-b)/(a-b + c-d) × (d-a).

In the above method, the system compares the value of each significant gene with its threshold value to obtain the final predicted score corresponding to each disease category includes the following two cases:

in the first case: the maximum value of the numerical range of disease species which can be screened by the significant gene is smaller than the threshold value, so that the final score of the disease species is the predicted score plus the regularized discrete value corresponding to the significant gene; otherwise, the significant gene is discarded;

in the second case: if the minimum value of the numerical range of the disease species which can be screened by the significant gene is larger than the threshold value, the final score of the disease species is that the prediction score should be added with the regularized discrete value corresponding to the significant gene, otherwise, the significant gene is discarded.

The present invention also provides a disease prediction system using gene detection based on spatial separability, comprising:

gene data input unit: used for inputting human genes and gene detection data;

a gene screening unit for screening a specific disease species: the device is used for determining the value ranges of various disease types corresponding to each gene input by the gene data input unit through the MAX function, and comparing whether the value ranges of the disease types of each gene are overlapped or not, so that the gene capable of screening a specific disease type is obtained;

a normalization processing unit: extracting the numerical value of each gene capable of identifying a specific disease species, and carrying out standardization processing on the corresponding line data of each gene;

a discrete value calculation unit: the system is used for calculating the mean value of the numerical range of each disease species corresponding to each gene capable of screening the specific disease species and the mean value of the numerical ranges of all disease species corresponding to the gene according to the line data which are standardized by the standardization processing unit and respectively have the genes capable of screening the specific disease species; and then calculating the distance expectation of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical range capable of screening the genes, namely obtaining the discrete values of the genes capable of screening the specific disease species.

A significant gene acquisition unit: sequencing the discrete values of the genes capable of distinguishing the specific disease species obtained by the discrete value calculation unit, and removing the genes which are obviously lower than 0.1 and capable of distinguishing the specific disease species to obtain obvious genes;

significant gene classification unit: the system is used for classifying the significant genes obtained by the significant gene acquisition unit according to disease species which can be identified, classifying the significant genes which can be identified in the same disease species into the same class and regularizing discrete values of the significant genes to obtain regularized discrete values;

significant gene threshold value calculation unit: the method comprises the steps of calculating a value range of a disease species which can be screened by each significant gene and a value range of a remaining disease species through a python basic command statement, and determining a threshold value of each significant gene;

a disease prediction unit: and the method is used for comparing the numerical value of each significant gene with a threshold value according to the preset initial prediction score of the disease species corresponding to each significant gene to obtain the final prediction score corresponding to each disease species, and judging the disease probability according to the prediction scores.

The invention also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the disease prediction method based on the spatial separability and utilizing the gene detection.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method for predicting a disease using gene detection based on spatial separability as set forth in any one of the above.

The invention provides a significant gene extraction method based on spatial separability, which predicts risks of various diseases of a gene detector according to significant genes of the gene detector by utilizing the numerical value dispersion degree of the genes in different disease species, and avoids the risks in advance; the invention can achieve the effect of detecting tens of thousands of genes only by detecting all the significant genes, and uses part of the significant genes to replace all the genes to predict the occurrence risk of diseases, thereby greatly reducing the number of the genes to be detected, reducing the cost required by the gene detection, relieving the pressure of doctors, simultaneously considering the calculated amount and the accuracy rate, and promoting the development and the popularization of the gene detection to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow diagram of a method provided by the present invention;

FIG. 2 is a schematic diagram of a system according to the present invention;

fig. 3 is a schematic block diagram of a computer device structure provided by the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The invention is described in detail below with reference to specific embodiments and the accompanying drawings.

As shown in fig. 1, the present invention provides a disease prediction method using gene detection based on spatial separability, comprising the steps of:

s1, obtaining human body genes and gene detection data;

s2, the system determines the value range of each disease species corresponding to each gene through the MAX function, and compares whether the value range of each disease species of each gene is overlapped or not, thereby obtaining the gene capable of screening the specific disease species.

Traversing the gene detection data in a gene unit (the gene data are stored in the system in an excel table form, and each row of the gene data is one), determining the maximum value and the minimum value of various disease species corresponding to each gene through a MAX function, and determining the range of the value domain; and comparing whether the value range ranges of the disease species corresponding to the single gene are overlapped or not, wherein if the value range ranges of the disease species corresponding to the single gene are not overlapped, the gene is the gene capable of identifying the specific disease species. Specifically, in this embodiment, according to different disease species corresponding to each gene, the Python has a max function, and the maximum value and the minimum value of the value of each disease species corresponding to each gene are automatically found, so as to determine the range of the value range. After the value range is obtained, judging whether the value ranges of the disease species corresponding to a single gene are overlapped, if not, indicating that the gene is capable of identifying the specific disease species; for example, taking the disease species A as an example, if the range of the disease species A overlaps with the range of any other different disease species, the subsequent calculation of the gene is discarded. If the value range of the disease species A is wholly lower or higher than the value range of all other different disease species, the gene is considered to have the value of screening the disease species A, namely the gene capable of screening the specific disease species.

S3, the system extracts the numerical values of the genes that can identify the specific disease species, and normalizes the line data of the corresponding genes (each gene data is tabulated in the form of a line) so that the range is between 0 and 1. The normalization process is specifically as follows:

the method comprises the steps of firstly calculating the maximum value and the minimum value of line data corresponding to a single gene through a max function, subtracting the minimum value from the maximum value to obtain a difference value, and subtracting the minimum value from the line data and dividing the difference value to obtain a regular value of the line data.

S4, according to the standardized line data with the genes capable of screening the specific disease species, the system calculates the mean value of the numerical range of each disease species corresponding to the genes capable of screening the specific disease species and the mean value of the numerical ranges of all disease species corresponding to the genes through the python basic command statement; for example, the disease type a is easily identified on genes a1, a2 and the like, and a row of data corresponding to the gene a1, for example, 24 rows, the 24 numbers, assuming that there are 6 people having the disease type a, the 6 numbers must have a maximum value and a minimum value, and the corresponding mean value can be calculated.

Then calculating the distance expectation (discrete value) of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical range capable of screening the genes to obtain the discrete value of each gene capable of screening the specific disease species; the larger the dispersion value, the stronger the ability of the gene to discriminate the disease species. For example, gene A1, can be used to screen for a particular disease A, calculate an average a for data corresponding to disease A, and screen for an average bcd corresponding to disease B, C, D. The average of the differences of data a to bcd is calculated, and the larger the value, the stronger the ability of the gene to discriminate the disease species.

S5, sequencing the obtained discrete values of the genes capable of screening the specific disease species by the system, and removing the genes which are obviously lower than 0.1 and capable of screening the specific disease species to obtain obvious genes; genes with knockouts significantly below 0.1 are too weak to discriminate the disease species.

A portion significantly less than 0.1 of the genes having the ability to discriminate a specific disease species is knocked out. The remaining genes capable of screening specific disease species are significant genes having the greatest influence on the prediction result.

S6, the system classifies the significant genes according to disease types which can be screened, classifies the significant genes which can be screened from the same disease type into the same type, and regularizes discrete values of the significant genes of the same type to obtain regularized discrete values. The specific way of regularization is:

determining the maximum value of the discrete values of all the significant genes corresponding to the disease species, and dividing the discrete value of each significant gene by the maximum value to obtain a normalized discrete value; for example, for a certain disease A, the maximum value of the discrete values of all the significant genes of the disease A is determined, and then the discrete value of each significant gene in the disease A is divided by the maximum value to finally obtain the normalized discrete value.

S7, the system calculates the value range of the disease species which can be screened by each significant gene and the value range of the remaining disease species through the basic command statement of python, and determines the threshold value of each significant gene. The method comprises the following specific steps:

the maximum value a and the minimum value b of the numerical range of disease species which can be screened by the significant gene are calculated through the basic command statement of python. Then, the maximum value C and the minimum value d of the numerical range of the remaining disease species are calculated (for example, the A1 is the disease A that can be screened by the significant gene, then the A1 corresponds to the B disease and the C disease and is the remaining disease species), and the threshold confirmation of each significant gene comprises the following two cases:

the first case is b > c, then the threshold for significant genes is equal to b- (a-b)/(a-b + c-d) × (b-c);

the second case is a < d, then the threshold for significant genes is equal to a + (a-b)/(a-b + c-d) × (d-a).

S8, setting an initial pre-measured score of 0 for the disease species corresponding to each significant gene; the system compares the numerical value of each significant gene with the threshold value thereof to obtain the final score corresponding to each disease species, the final score is high or low, the disease probability is judged, and the higher the score is, the higher the disease probability is. The method comprises the following specific steps:

the present embodiment is divided into the following two cases:

in the first case: the maximum value of the numerical range of the disease species which can be screened by the significant gene is smaller than the threshold value, so that the patient is very likely to have the disease species which can be screened by the significant gene, and therefore the final score of the disease species is the predicted score (which is 0) plus the regularized discrete value corresponding to the significant gene. Otherwise, it indicates that the patient is likely not carrying the disease species that the significant gene can discriminate, and therefore the calculation of the significant gene is discarded.

In the second case: if the minimum value of the numerical range of the disease species which can be screened by the significant gene is larger than the threshold value, the final score of the disease species is the predicted score plus the regularized discrete value corresponding to the significant gene, otherwise, the subsequent calculation of the significant gene is discarded; after all the significant genes are calculated, the final score corresponding to each disease species can be obtained, and the disease probability can be judged according to the score.

The embodiment provides a significant gene extraction method based on spatial separability, which predicts risks of various diseases according to significant genes of a gene detector by using the numerical value dispersion degree of the genes in different disease species, and avoids the risks in advance; the invention can achieve the effect of detecting tens of thousands of genes only by detecting all the significant genes, and uses part of the significant genes to replace all the genes to predict the occurrence risk of diseases, thereby greatly reducing the number of the genes to be detected, reducing the cost required by the gene detection, relieving the pressure of doctors, simultaneously considering the calculated amount and the accuracy rate, and promoting the development and the popularization of the gene detection to a certain extent.

As shown in FIG. 2, the present invention also provides a disease prediction system using gene detection based on spatial separability, comprising,

gene data input unit: used for obtaining human genes and gene detection data;

a gene screening unit for screening a specific disease species: the method is used for determining the value ranges of various disease types corresponding to each gene through the MAX function, and comparing whether the value ranges of the disease types of each gene are overlapped or not, so that the gene capable of identifying the specific disease type is obtained.

Traversing the gene detection data by using a gene as a unit, determining the maximum value and the minimum value of various disease species corresponding to each gene through an MAX function, and determining the range of a value range; and determining that the gene has the gene capable of screening the specific disease species according to the comparison of whether the value range ranges of the disease species corresponding to the single gene are overlapped. Specifically, in this embodiment, according to different disease species corresponding to each gene, the Python has a max function, and the maximum value and the minimum value of the value of each disease species corresponding to each gene are automatically found, so as to determine the range of the value range. And after obtaining the range of the value range, judging whether the value ranges of the disease species corresponding to the single gene are overlapped, and if the value ranges of the disease species corresponding to the single gene are not overlapped, determining that the disease species has the gene capable of identifying the specific disease species.

A normalization processing unit: the method is used for extracting numerical values of genes capable of identifying specific disease species, and the corresponding line data of the genes are subjected to standardization processing, so that the range of the line data is 0-1.

The normalization process is specifically as follows:

the method comprises the steps of firstly calculating the maximum value and the minimum value of row data corresponding to a single gene through a Max function, subtracting the minimum value from the maximum value to obtain a difference value, and subtracting the minimum value from each row of data and dividing the difference value to obtain a regular value of the row data.

A discrete value calculation unit: the method is used for calculating the mean value of the numerical range of each disease species corresponding to each gene capable of screening the specific disease species and the mean value of all the numerical ranges of the disease species corresponding to the gene through a python basic command statement according to the data of the standardized row with the genes capable of screening the specific disease species; then calculating the discrete values (distance expectation) of the mean value of all disease species numerical ranges corresponding to the genes capable of screening the specific disease species and the mean value of the disease species numerical range capable of screening the genes to obtain the discrete values of the genes capable of screening the specific disease species; the larger the dispersion value, the stronger the ability of the gene to discriminate the disease species.

A significant gene acquisition unit: and (3) sequencing the discrete values of the genes capable of distinguishing the specific disease species obtained by the discrete value calculation unit, and removing the genes which are obviously lower than 0.1 and capable of distinguishing the specific disease species to obtain the obvious genes.

Significant gene classification unit: the method is used for classifying the significant genes obtained by the significant gene obtaining unit according to the disease species which can be screened, classifying the significant genes which can be screened from the same disease species into the same class, and regularizing the discrete values of the significant genes to obtain the regularized discrete values. The specific way of regularization is:

determining the maximum value of the discrete values of all the significant genes corresponding to the disease category, and dividing the discrete value of each significant gene by the maximum value to obtain a normalized discrete value; for example, for a certain disease A, the maximum value of the discrete values of all the significant genes of the disease A is determined, and then the discrete value of each significant gene in the disease A is divided by the maximum value to finally obtain the normalized discrete value.

Significant gene threshold value calculation unit: and calculating the value range of the disease species which can be screened by each significant gene and the value range of the remaining disease species through the basic command statement of python, and determining the threshold value of each significant gene. The method comprises the following specific steps:

the maximum value a and the minimum value b of the numerical range of disease species which can be screened by the significant gene are calculated through the basic command statement of python. Then calculating the maximum value c and the minimum value d of the numerical range of the remaining disease species, and determining the threshold value of each significant gene; in this case, a distinction is made between two cases, the first of which is that b is greater than c, and the threshold is then equal to b- (a-b)/(a-b + c-d) × (b-c);

the second case is that a is less than d, then the threshold is equal to a + (a-b)/(a-b + c-d) × (d-a).

A disease prediction unit: the system comprises a plurality of genes, a threshold value and a prediction score, wherein the genes are used for acquiring disease types of the patients, the disease types are used for acquiring initial prediction scores corresponding to the genes, the initial prediction scores are preset according to the disease types corresponding to the significant genes, the numerical value of each significant gene is compared with the threshold value to obtain final prediction scores corresponding to the disease types, the disease probability is judged according to the prediction scores, and the initial prediction scores are set to;

the present embodiment is divided into the following two cases:

in the first case: the maximum value of the numerical range of the disease species which can be screened by the significant gene is smaller than the threshold value, so that the patient is very likely to have the disease species which can be screened by the significant gene, and the prediction score of the disease species is added with the normalized discrete value corresponding to the significant gene. Otherwise, it indicates that the patient is likely not carrying the disease species that the significant gene can discriminate, and therefore the calculation of the significant gene is discarded.

In the second case: if the minimum value of the numerical range of the disease species which can be screened by the significant gene is larger than the threshold value, the predicted score of the disease species should be added with the regularized discrete value corresponding to the significant gene, otherwise, the subsequent calculation of the significant gene is discarded; after all the significant genes are calculated, the final prediction score corresponding to each disease species can be obtained, and the disease probability can be judged according to the score.

The method and system are described below by way of specific examples.

The case was verified on 24 cases provided by a biotechnology company of south Beijing, Jiangsu. Each of the 24 cases has 57821 genes, and each of the 24 cases corresponds to 6 cancers and each of the 24 cancers corresponds to 4 cases. The algorithm is firstly trained on 24 data, and then in the first 3 additional case tests provided, the corresponding cancer disease species are predicted to have all hits; and adding the first batch of test sets as available data into the data of the original 24 cases, training by using the algorithm again, and testing in the second batch of additional 10 cases to predict that the result hits 9 cases. One example is where the predicted outcomes of two disease species are too close to cause an error. The algorithm generally shows a certain effectiveness.

As shown in table 1 below, is data stored in tabular form, where columns B-F are the first case and columns G-M are the second case; FIG. 2 is an intermediate result of the algorithm for prediction; wherein, the gene numbers and the gene names are in one-to-one correspondence; the discrete proportion is the prediction score of each gene of a certain patient corresponding to a certain disease through weighted average; the probability of the disease is determined by the score, for example, the total number of two predicted diseases is calculated, and the predicted score of the first case is 0.2, the predicted score of the second case is blue 0.1, and the probability of the disease of the first case is 0.667. The true color margin is the maximum or minimum in the training data for a color such as the first case.

TABLE 1 case data

TABLE 2 intermediate results of the operations

Figure BDA0002564025010000122

As shown in fig. 3, the present invention further provides a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the disease prediction method using gene detection based on spatial separability in the above embodiments.

The present invention also provides a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the recognition model training method in the above embodiments, or the computer program, when executed by the processor, implementing one of the above embodiments of a disease prediction method using gene detection based on spatial separability.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种高氮钢高温钎焊过程中Fe-Cu、Fe-Ni二元体系分子动力学扩散模拟方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!