Fast packing formula Gene Selection Method based on maximal correlation minimal redundancy

文档序号：1773629 发布日期：2019-12-03 浏览：27次中文

阅读说明：本技术 基于最大相关最小冗余的快速封装式基因选择方法 (Fast packing formula Gene Selection Method based on maximal correlation minimal redundancy ) 是由杨静沈安波方宝富王浩于 2019-08-29 设计创作，主要内容包括：本发明公开了一种基于最大相关最小冗余的快速封装式基因选择方法,其步骤包括：1、利用相关性方法在基因向量组中寻找与类别标签具有最大相关度的基因,并将其加入候选基因子集；2、利用最大相关最小冗余方法在基因向量组中寻找具有最大相关冗余度的基因,并将其加入候选基因子集；3、利用十折交叉验证方法判断更新候选基因子集前后的两个候选基因子集的分类精度是否降低；4、若降低,则输出更新前的候选基因子集,否则重复步骤2。本发明能获得较高质量的基因子集,同时显著降低一般的封装方法的时间复杂度,从而使得在基因子集的获取上具有良好的时间性能,并使得获得的基因子集拥有较好的分类性能。(The invention discloses a kind of fast packing formula Gene Selection Method based on maximal correlation minimal redundancy, its step is included: 1, is found the gene for having maximum relation degree with class label in gene vectors group using correlation method, and is added into candidate gene subset；2, the gene with maximal correlation redundancy is found in gene vectors group using maximal correlation minimal redundancy method, and is added into candidate gene subset；3, judge whether the nicety of grading of two candidate gene subsets before and after update candidate gene subset reduces using ten folding cross validation methods；If 4, reducing, the candidate gene subset before updating is exported, otherwise repeatedly step 2.The present invention can obtain the gene subset of better quality, while significantly reduce the time complexity of general packaging method, so that having good time performance in the acquisition of gene subset, and the gene subset obtained be made to possess preferable classification performance.)

1. a kind of fast packing formula Gene Selection Method based on maximal correlation minimal redundancy is applied to by n microarray base Because being denoted as Data={ inst in the data set Data of data composition₁,inst₂,…,inst_i,…,inst_n}；inst_iIndicate i-th A microarrayed genes data；And Indicate i-th of microarrayed genes data inst_iIn j-th of gene；C_iIndicate i-th of microarrayed genes data inst_iClass variable, as cancer is abnormal/normal；By n J-th of gene vectors of j-th of gene composition data set Data of a microarrayed genes data, are denoted asThe gene vectors group that data set Data is made of m gene vectors, is denoted as F={ f₁, f₂,...,f_j,....,f_m}；The class variable of n microarrayed genes data forms categorization vector, is denoted as C={ C₁,C₂,..., C_i,...,C_n}；To obtain the gene vectors data set being made of m gene vectors and a categorization vector, it is denoted as D= {f₁,f₂,...,f_j,...,f_m,C}；1≤i≤n；1≤j≤m；It is characterized in that the fast packing formula Gene Selection Method is It carries out in accordance with the following steps:

Step 1 defines candidate gene subset S, and initializes

Step 2, according to gene vectors group F and categorization vector C, successively calculated using maximum correlation method and record gene vectors The degree of correlation of each gene vectors and categorization vector C in group F, is denoted as degree of correlation set { I (f₁；C),I(f₂；C),…,I(f_j； C),…,I(f_m；C)}；Wherein, I (f_j；C the degree of correlation of j-th gene vectors and categorization vector C) is indicated；

Step 3, definition cycle values are k, and initialize k=1；

Step 4 finds the gene vectors for having maximum relation degree with categorization vector C from the degree of correlation set, and will be corresponding Gene vectors be denoted as s_k；

Step 5, by gene vectors s_kIt is added in candidate gene subset S, to obtain the candidate gene subset S that kth time updates_k； And corresponding gene vectors s is deleted from gene vectors group F_k, to obtain the gene vectors group F after kth time is deleted_k；

Step 6 calculates the gene vectors group F after kth time is deleted using maximal correlation minimal redundancy method_kIn each gene to The candidate gene subset S that amount is updated with kth time_kIn each gene vectors about the relevant redundancy value between categorization vector C, note Make relevant redundancy value set；

Step 7 finds the candidate gene subset S updated with kth time from the relevant redundancy value set_kIn each gene to The vector that there is maximum relevant redundancy value about categorization vector C is measured, and corresponding gene vectors are denoted as s_k+1；

Step 8, by gene vectors s_k+1It is added to the candidate gene subset S that kth time updates_kIn, to obtain kth+1 time update Candidate gene subset S_k+1；And from kth time delete after gene vectors group F_kIt is middle to delete corresponding gene vectors s_k+1, to obtain Gene vectors group F after kth+1 time deletion_k+1；

Step 9 judges the candidate gene subset S that kth time updates using cross validation method_kTo the nicety of grading phase of categorization vector C Than the candidate gene subset S updated in kth+1 time_k+1Whether reduce；If reducing, then it represents that complete gene selects, obtain and export The candidate gene subset S that kth time updates_k；Otherwise, after k+1 being assigned to k, and return step 6 executes.

2. fast packing formula Gene Selection Method according to claim 1, characterized in that the maximum phase in the step 2 Closing property method is to carry out in accordance with the following steps:

Step 2.1, initialization j=1；

Step 2.2 is calculated using formula (1) and saves j-th of gene vectors f in gene vectors group F_jWith the phase between categorization vector C Closing property I (f_j；C):

I(f_j；C)=H (C)-H (C | f_j) (1)

Formula (1), H (C) indicate the comentropy of categorization vector C；H(C|f_j) indicate in j-th of gene vectors f_jUnder the conditions of categorization vector The conditional information entropy of C；

J+1 is assigned to j by step 2.3, judges whether j≤m is true, if so, then return step 2.2 executes；Otherwise step is executed Rapid 3.

3. fast packing formula Gene Selection Method according to claim 1, characterized in that the maximum phase in the step 6 Closing minimal redundancy method is to carry out in accordance with the following steps:

Step 6.1 defines the candidate gene subset S that kth time updates_kGene dosage be | S_k|, kth time delete after gene to Amount group F_kGene dosage be | F_k|；

Step 6.2, initialization p=1；

Step 6.3 calculates according to formula (2) and saves the gene vectors group F after kth time is deleted_kIn p-th of gene vectors f_pWith The candidate gene subset S that kth time updates_kRelevant redundancy value J about categorization vector C_S(f_p):

P+1 is assigned to p by step 6.4, judge p≤| F_k| it is whether true, if so, then return step 6.3 executes；Otherwise it executes Step 7.

4. fast packing formula Gene Selection Method according to claim 1, characterized in that the intersection in the step 9 is tested Card method is to carry out in accordance with the following steps:

Microarrayed genes data set Data is mapped in the candidate gene subset S that kth time updates by step 9.1 respectively_kWith classification to Measure the candidate gene subset S on C with kth+1 time update_k+1On categorization vector C, to obtain the data set that kth time about subtracts Data_kThe data set Data about subtracted with kth+1 time_k+1；

Step 9.2, the data set Data for respectively about subtracting kth time_kThe data set Data about subtracted with kth+1 time_k+1In microarray Gene data is divided into δ parts at random, successively chooses portion therein in turn as test set, remaining δ -1 parts are used as training set For training classifier, so that δ nicety of grading is obtained respectively, it is hereby achieved that k-th of average nicety of grading, is denoted as Avg_kAnd kth+1 average nicety of grading, it is denoted as Avg_k+1；

Step 9.3 judges Avg_k> Avg_k+1It is whether true, if so, it then indicates precision reduction, otherwise indicates that precision does not reduce.

Technical field

The invention belongs to the field of data mining, specifically a kind of fast packing formula based on maximal correlation minimal redundancy Gene Selection Method.

Background technique

Gene selects as a kind of Data Dimensionality Reduction technology, be widely used in gene data analysis and genopathy it is pre- It surveys in work.The gene data of higher-dimension may be containing redundancy and uncorrelated gene, if thus by full gene applied to classifier Training and prediction in, frequently can lead to classifier with poor performance, be mainly manifested in two aspects: classification performance and time Performance.Effectively Gene Selection Method can not only reduce the dimension in original gene space, in the classification performance for improving classifier With its Generalization Capability can also be promoted while time performance.In addition, Gene Selection Method can also help researcher find with One group of highly relevant gene of classification task, enhances the interpretation of disaggregated model.For example, the GeneScreen based on microarray data It chooses, the gene highly relevant with particular requirement is found by feature selection approach, help to improve the standard of corresponding requirements prediction Exactness, and the candidate target gene that Gene Selection Method is selected can be effectively reduced the cost for finding biological targets.

Disaggregated model is applied in feature selection process based on the Gene Selection Method of packaged type, by the classification of classifier Evaluation criterion of the precision as generated gene subset, therefore the Gene Selection Method based on encapsulation usually possesses good point Class device adaptability, and the gene subset of better quality usually can be obtained.But general packaged type method is obtaining high quality A large amount of classifier evaluation operation can be executed before gene subset, therefore higher time complexity can seriously affect this method Application range.

The major defect of such method includes,

(1) in gene selection process, need to use the nicety of grading of classifier as each gene subset superiority and inferiority of assessment Index, it is therefore desirable to execute a large amount of classifier evaluation operation, i.e. one new gene subset of every generation requires to execute classification The training of device and test process, thus drastically influence the time performance of feature selection process；

(2) relationship between selected character subset quantity and the nicety of grading of character subset cannot be balanced well, such as: point Similar two character subsets of class performance, can preferential selection sort precision it is higher that, without considering two feature The feature quantity of collection, therefore will lead to the character subset that classification performance is slightly lower but feature quantity is less and be abandoned.

Summary of the invention

The present invention is in place of overcoming the shortcomings of the prior art, to be based on microarray gene expression data, propose a kind of base In the fast packing formula Gene Selection Method of maximal correlation minimal redundancy, to which the gene subset of better quality can be obtained, simultaneously The time complexity of general packaging method is significantly reduced, so that when having good on the acquisition process of gene subset Between performance, and gene subset obtained is made to possess preferable classification performance.

The present invention adopts the following technical scheme that in order to solve the technical problem

A kind of fast packing formula Gene Selection Method based on maximal correlation minimal redundancy of the present invention is applied to by n In the data set Data of microarrayed genes data composition, it is denoted as Data={ inst₁,inst₂,…,inst_i,…,inst_n}； inst_iIndicate i-th of microarrayed genes data；And Indicate i-th of micro- battle array Column gene data inst_iIn j-th of gene；C_iIndicate i-th of microarrayed genes data inst_iClass variable, as cancer is different Often/normal；J-th of gene vectors that data set Data is made of j-th of gene of n microarrayed genes data, are denoted asThe gene vectors group that data set Data is made of m gene vectors, is denoted as F={ f₁, f₂,...,f_j,....,f_m}；The class variable of n microarrayed genes data forms categorization vector, is denoted as C={ C₁,C₂,..., C_i,...,C_n}；To obtain the gene vectors data set being made of m gene vectors and a categorization vector, it is denoted as D= {f₁,f₂,...,f_j,...,f_m,C}；1≤i≤n；1≤j≤m；Its main feature is that the fast packing formula Gene Selection Method is It carries out in accordance with the following steps:

Step 1 defines candidate gene subset S, and initializes

Step 2, according to gene vectors group F and categorization vector C, successively calculated using maximum correlation method and record gene The degree of correlation of each gene vectors and categorization vector C in Vector Groups F, is denoted as degree of correlation set { I (f₁；C),I(f₂；C),…,I (f_j；C),…,I(f_m；C)}；Wherein, I (f_j；C the degree of correlation of j-th gene vectors and categorization vector C) is indicated；

Step 3, definition cycle values are k, and initialize k=1；

Step 4 finds the gene vectors for having maximum relation degree with categorization vector C from the degree of correlation set, and will Corresponding gene vectors are denoted as s_k；

Step 5, by gene vectors s_kIt is added in candidate gene subset S, to obtain candidate gene that kth time updates Collect S_k；And corresponding gene vectors s is deleted from gene vectors group F_k, to obtain the gene vectors group F after kth time is deleted_k；

Step 6 calculates the gene vectors group F after kth time is deleted using maximal correlation minimal redundancy method_kIn each base The candidate gene subset S updated by vector and kth time_kIn each gene vectors about the relevant redundancy between categorization vector C Value, is denoted as relevant redundancy value set；

Step 7 finds the candidate gene subset S updated with kth time from the relevant redundancy value set_kIn each base Vector because of vector about categorization vector C with maximum relevant redundancy value, and corresponding gene vectors are denoted as s_k+1；

Step 8, by gene vectors s_k+1It is added to the candidate gene subset S that kth time updates_kIn, to obtain kth+1 time The candidate gene subset S of update_k+1；And from kth time delete after gene vectors group F_kIt is middle to delete corresponding gene vectors s_k+1, from And obtain the gene vectors group F after deleting kth+1 time_k+1；

Step 9 judges the candidate gene subset S that kth time updates using cross validation method_kClassification to categorization vector C The candidate gene subset S that precision is updated compared to kth+1 time_k+1Whether reduce；If reducing, then it represents that complete gene selects, obtain And export the candidate gene subset S that kth time updates_k；Otherwise, after k+1 being assigned to k, and return step 6 executes.

The characteristics of fast packing formula Gene Selection Method of the present invention, lies also in, the maximal correlation in the step 2 Property method is to carry out in accordance with the following steps:

Step 2.1, initialization j=1；

Step 2.2 is calculated using formula (1) and saves j-th of gene vectors f in gene vectors group F_jWith categorization vector C it Between correlation I (f_j；C):

I(f_j；C)=H (C)-H (C | f_j) (1)

Formula (1), H (C) indicate the comentropy of categorization vector C；H(C|f_j) indicate in j-th of gene vectors f_jUnder the conditions of classification The conditional information entropy of vector C；

J+1 is assigned to j by step 2.3, judges whether j≤m is true, if so, then return step 2.2 executes；Otherwise it holds Row step 3.

Maximal correlation minimal redundancy method in the step 6 is to carry out in accordance with the following steps:

Step 6.1 defines the candidate gene subset S that kth time updates_kGene dosage be | S_k|, the base after kth time deletion Because of Vector Groups F_kGene dosage be | F_k|；

Step 6.2, initialization p=1；

Step 6.3 calculates according to formula (2) and saves the gene vectors group F after kth time is deleted_kIn p-th of gene vectors f_pThe candidate gene subset S updated with kth time_kRelevant redundancy value J about categorization vector C_S(f_p):

P+1 is assigned to p by step 6.4, judge p≤| F_k| it is whether true, if so, then return step 6.3 executes；It is no Then follow the steps 7.

Cross validation method in the step 9 is to carry out in accordance with the following steps:

Microarrayed genes data set Data is mapped in the candidate gene subset S that kth time updates by step 9.1 respectively_kWith class The candidate gene subset S updated on other vector C with kth+1 time_k+1On categorization vector C, to obtain the data that kth time about subtracts Collect Data_kThe data set Data about subtracted with kth+1 time_k+1；

Step 9.2, the data set Data for respectively about subtracting kth time_kThe data set Data about subtracted with kth+1 time_k+1In it is micro- Array gene data are divided into δ parts at random, successively choose portion therein in turn as test set, remaining δ -1 parts as instruction Practice collection for training classifier, so that δ nicety of grading is obtained respectively, it is hereby achieved that k-th of average nicety of grading, is denoted as Avg_kAnd kth+1 average nicety of grading, it is denoted as Avg_k+1；

Step 9.3 judges Avg_k> Avg_k+1It is whether true, if so, it then indicates precision reduction, otherwise indicates precision not It reduces.

Compared with the prior art, the beneficial effects of the present invention are embodied in:

1, fast packing formula Gene Selection Method of the present invention, be based on maximal correlation minimal redundancy method, can be quick Calculate gene between correlation and redundancy.On the one hand: the not explicit definition related gene of this method and redundancy gene, and It is that the measurement of correlation and redundancy is included in relevant redundancy value, implicit expression correlativity and redundancy relationship.It removes The classification information about target variable that is included of the gene with High redundancy be already contained in selected gene；Another party Face can significantly reduce the size of search space by didactic gene subset search, and then reduce commenting for classifier Estimate the number of gene subset, the significant speed for improving packaged type Gene Selection Method.

2, method proposed by the present invention is substantially a kind of Gene Selection Method of encapsulation type for being mixed with filtering type, is passed through A kind of didactic gene subset search strategy plays the rapidity of filtering type method in conjunction with the accuracy of packaged type method Come, quickly obtains the gene subset with high quality.

3, every newly generated optimal gene vectors of wheel are added to have selected in gene sets and be assessed by the present invention, rather than It after all genes are all calculated, is successively assessed, improves the time efficiency of filtration step, and preferable weighing apparatus so significantly The relationship between nicety of grading and gene sub-set size is measured.

4, the present invention evaluates gene vectors using maximal correlation minimal redundancy method, therefore can portray between variable Linear and nonlinear relationship, thus this method is suitable for the analysis work of various gene datas, is conducive to help to study people Member has found gene relevant to goal in research, to more fully understand the object of research.

Specific embodiment

In the present embodiment, a kind of fast packing formula Gene Selection Method based on maximal correlation minimal redundancy, is to be applied to In the data set Data be made of n microarrayed genes data, it is denoted as Data={ inst₁,inst₂,…,inst_i,…, inst_n}；inst_iIndicate i-th of microarrayed genes data；And Indicate i-th A microarrayed genes data inst_iIn j-th of gene；C_iIndicate i-th of microarrayed genes data inst_iClass variable, such as it is different Often/normal；J-th of gene vectors that data set Data is made of j-th of gene of n microarrayed genes data, are denoted asThe gene vectors group that data set Data is made of m gene vectors, is denoted as F={ f₁, f₂,...,f_j,….,f_m}；The class variable of n microarrayed genes data forms categorization vector, is denoted as C={ C₁,C₂,..., C_i,...,C_n}；To obtain the gene vectors data set being made of m gene vectors and a categorization vector, it is denoted as D= {f₁,f₂,...,f_j,...,f_m,C}；1≤i≤n；1≤j≤m；It is characterized in that fast packing formula Gene Selection Method be according to Following steps carry out:

Step 1 defines candidate gene subset S, and initializes

Specifically, using the information entropy theory in information theory, successively calculate each gene vectors in gene vectors group F with The relevance degree of class variable C, and be saved in degree of correlation set.

Step 2.1, initialization j=1；

Step 2.2 is calculated using formula (1) and saves j-th of gene vectors f in gene vectors group F_jWith categorization vector C it Between correlation I (f_j；C):

I(f_j；C)=H (C)-H (C | f_j) (1)

Formula (1), H (C) indicate the comentropy of categorization vector C, the i.e. uncertainty degree of vector C；H(C|f_j) indicate in jth A gene vectors f_jUnder the conditions of categorization vector C conditional information entropy；The advantages of use information entropy, is that it is possible between reflection variable Linear and nonlinear correlation, the visible bibliography of specifically calculating of comentropy 《FeatureSelectionBasedonMutual Information:CriteriaofMax-Dependency,Max- Relevance, andMin-Redundancy " in description.Relevance degree I (f_j；C it) is also referred to as association relationship, is used for quantization signifying J-th of gene vectors f_jDegree of correlation size between categorization vector C, value is bigger, shows f_jIncluded to categorization vector C Classification information is more.

J+1 is assigned to j by step 2.3, judges whether j≤m is true, if so, then return step 2.2 executes；Otherwise it holds Row step 3.

Step 3, definition cycle values are k, and initialize k=1；

Step 4 finds the gene vectors for having maximum relation degree with categorization vector C from degree of correlation set, and will be corresponding Gene vectors be denoted as s_k；

Step 6.1 defines the candidate gene subset S that kth time updates_kGene dosage be | S_k|, the base after kth time deletion Because of Vector Groups F_kGene dosage be | F_k|；

Step 6.2, initialization p=1；

Formula (2), I (f_p；C gene vectors group F) is indicated_kIn p-th of gene vectors f_pWith the relevance values of categorization vector C；Indicate gene vectors group F_kIn p-th of gene vectors f_pWith candidate gene subset S_kBetween redundancy Value；Therefore the poor J of relevance values and redundancy_S(f_p) indicate gene vectors group F_kIn p-th of gene vectors f_pCorrelation with The comprehensive measurement of redundancy, value is bigger, shows gene vectors f_pIncluded is more to the classification information of categorization vector C and same When included redundancy it is fewer.

P+1 is assigned to p by step 6.4, judge p≤| F_k| it is whether true, if so, then return step 7.3 executes；It is no Then follow the steps 7.

Step 7 finds the candidate gene subset S updated with kth time from relevant redundancy value set_kIn each gene to The vector that there is maximum relevant redundancy value about categorization vector C is measured, and corresponding gene vectors are denoted as s_k+1；

Step 9.2, the data set Data for respectively about subtracting kth time_kThe data set Data about subtracted with kth+1 time_k+1In it is micro- Array gene data are divided into δ parts at random, and (δ >=2, δ usually takes 5 or 10) in practical applications, successively chooses therein one in turn Part is used as test set, and remaining δ -1 parts are used to train classifier as training set, to obtain δ nicety of grading respectively, thus Available k-th average nicety of grading, is denoted as Avg_kAnd kth+1 average nicety of grading, it is denoted as Avg_k+1；It is tested using intersection The mode of card enables to the nicety of grading obtained more objective, i.e., the classification of the verified character subset of preferable reflection Energy.

Step 9.3 judges Avg_k> Avg_k+1It is whether true, if so, it then indicates precision reduction, otherwise indicates precision not It reduces.

8页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种提取免疫治疗新抗原的方法及系统

Fast packing formula Gene Selection Method based on maximal correlation minimal redundancy

相关技术

网友询问留言