Efficient whole genome selection method capable of realizing parallel operation and high accuracy

文档序号:1650364 发布日期:2019-12-24 浏览:39次 中文

阅读说明:本技术 一种高效可并行运算且高准确性的全基因组选择方法 (Efficient whole genome selection method capable of realizing parallel operation and high accuracy ) 是由 赵书红 尹立林 刘小磊 李新云 余梅 朱猛进 唐振双 许婧雅 殷东 于 2019-09-11 设计创作,主要内容包括:本发明涉及动植物育种及人类疾病预测技术领域,提供一种高效可并行运算且高准确性的全基因组选择方法。首先读取原始基因型文件和表型文件,构建新的基因型文件和表型文件,计算所有个体间的亲缘关系矩阵;然后提取新的表型文件中所有个体作为参考群,提取原始基因型文件中无表型数据的所有个体作为预测群;接着利用参考群数据进行全基因组关联分析,提取全基因组关联分析的结果特征;构建性状特异的模型库,采用交叉验证策略,依次优化最佳固定效应、最佳随机效应,从模型库中选取最优预测模型;最后,利用最优预测模型,计算预测群的基因组估计育种值。本发明能够快速准确且稳定地预测出个体基因组育种值,提升全基因组选择的准确性及效率。(The invention relates to the technical field of animal and plant breeding and human disease prediction, and provides a high-efficiency parallel operation and high-accuracy whole genome selection method. Reading an original genotype file and a phenotype file, constructing a new genotype file and a new phenotype file, and calculating a genetic relationship matrix among all individuals; then extracting all individuals in the new phenotype file as a reference group, and extracting all individuals without phenotype data in the original genotype file as a prediction group; then, performing whole genome association analysis by using the reference group data, and extracting the result characteristics of the whole genome association analysis; constructing a model base with specific characters, sequentially optimizing an optimal fixed effect and an optimal random effect by adopting a cross validation strategy, and selecting an optimal prediction model from the model base; and finally, calculating the genome estimated breeding value of the prediction group by using the optimal prediction model. The method can quickly, accurately and stably predict the individual genome breeding value and improve the accuracy and efficiency of whole genome selection.)

1. A high-efficiency parallel operation and high-accuracy whole genome selection method is characterized by comprising the following steps:

step 1: reading an original genotype file and an original phenotype file, extracting genotype data and phenotype data of the same individuals in the original genotype file and the original phenotype file to form a new genotype file and a new phenotype file, and calculating an affinity relationship matrix G between all individuals by using the new genotype file;

step 2: extracting all individuals in the new phenotype file as a reference group, extracting all individuals without phenotype data in the original genotype file as a prediction group to obtain reference group data and prediction group data, and randomly dividing the reference group into M sub-reference groups with the same scale; wherein the reference population data comprises genotype data and phenotype data for each individual in the reference population and the prediction population data comprises genotype data for each individual in the prediction population;

and step 3: performing whole genome association analysis by using the reference group data, and extracting the result characteristics of the whole genome association analysis; constructing a model base with specific characters, sequentially optimizing an optimal fixed effect and an optimal random effect by adopting a cross validation strategy, and selecting an optimal prediction model from the model base;

and 4, step 4: calculating a genomic estimated breeding value for the prediction population using the optimal prediction model.

2. The method for efficient parallel-operable and highly accurate genome wide selection according to claim 1, wherein the step 3 comprises the steps of:

step 3.1: performing correlation coefficient calculation in parallel using the L threads; wherein performing correlation coefficient calculations using the L ∈ {1, 2.,. L } threads comprises:

step 3.1.1: randomly selecting M-1 sub-reference groups to synthesize a test group, taking the unselected sub-reference groups as verification groups, performing whole genome association analysis by using the test groups, and extracting significant P values of all loci in the genotype as { P by using a selected model1l,P2l,...,Pnl,...,PNl}; wherein N is the total number of loci in the genotype, PnlCalculating a significant P value for the nth site in the genotype for the lth thread, N ∈ {1, 2.., N };

step 3.1.2: dividing all the sites by using a window with a preset size according to the distribution sequence on the genome to obtain x sites in the window; sequencing the sites in each window from small to large according to the significant P value, and selecting the site with the maximum significant P value in each window to form a site set X;

step 3.1.3: calculation of the genome estimated breeding value GEBV of the validation population Using the test population and GBLUP model1,GEBV1Calculating GEBV only containing random effect part1Correlation coefficient C with the true phenotype of the validation population0l(ii) a Testing a fixed effect model FLM by using the site set X, sequentially taking out sites one by one without putting back the sites from the site set X, adding the sites into the FLM as covariates, and calculating a genome estimated breeding value GEBV of the verification group2,GEBV2Containing only fixed effect parts, and calculating GEBV2The set of correlation coefficients between true phenotypes of the verification clusters is { Cf1l,Cf2l,...,Cfil,...,Cfxl}; testing a mixed effect model MLM by using the locus set X, sequentially taking out the loci from the locus set X one by one without putting back the loci, adding the loci into the MLM as covariates, and calculating a genome estimated breeding value GEBV of a verification group3,GEBV3Including two parts of fixed effect and random effect, and calculating GEBV3The set of correlation coefficients between true phenotypes of the verification clusters is { Cm1l,Cm2l,...,Cmil,...,Cmxl}; wherein i ∈ {1,2,..., x };

step 3.1.4: if Cfil>Cf,i-1,lAnd Cfil>C0lIf the ith site in the site set X is an FLM effective site, the window corresponding to the ith site is an FLM effective window, and an FLM effective window set F is obtainedl(ii) a If Cmil>Cm,i-1,lAnd Cmil>C0lIf the ith site in the site set X is an MLM effective site, the window corresponding to the ith site is an MLM effective window, and an MLM effective window set M is obtainedl

Step 3.2: meterCalculating { C01,C02,...,C0l,...,C0LMean value ofCalculation of { Cfi1,Cfi2,...,Cfil,...,CfiLMean value of{Cmi1,Cmi2,...,Cmil,...,CmiLMean value ofi belongs to {1, 2.,. x }, and a first set of mean values is obtainedThe second set of mean values isCalculating the maximum value of the elements in the first set of mean values asMaximum of the elements in the second set of means isIf it isSelecting the optimal prediction model as FLM; if it isSelecting an optimal prediction model as an MLM; if it isAnd isSelecting an optimal prediction model as GBLUP;

step 3.3: if the optimal prediction model is FLM, the L FLM valid window sets { F1,F2,...,Fl,...,FLCounting windows in the FLM, and selecting an FLM effective window with the occurrence frequency of more than or equal to L multiplied by 95 percent as a final selection FLM effective window; if the optimal prediction model is MLM, the L MLM valid window sets { M1,M2,...,Ml,...,MLCounting windows in the window selection method, and selecting an MLM effective window with the occurrence frequency of which is more than or equal to L multiplied by 95 percent as a final selected MLM effective window;

step 3.4: calculating { Pn1,Pn2,...,Pnl,...,PnLThe assigned value of { is used as the final associated P value of the nth siteThe final correlation P value of all the sites is obtained asSelecting a site with the maximum final associated P value from a final FLM effective window as an FLM optimal covariate site or selecting a site with the maximum final associated P value from a final selected MLM effective window as an MLM optimal covariate site;

step 3.5: performing gradient lower relation number calculation in parallel by using L threads; wherein performing gradient-wise correlation computation using the L ∈ {1, 2.., L } threads comprises:

step 3.5.1: if the optimal prediction model is GBLUP or MLM, initializing a diagonal weight matrix W of NxN to diag (W) based on Vanraden algorithm1,w2,...,wN) Diag (1, 1.., 1); significant P values { P ] for all sites obtained in step 3.1.11l,P2l,...,Pnl,...,PNlSorting according to the sequence from small to large to obtain a sorted significant P value sequence (P)1l',P2l',...,Pnl',...,PNl' } setting the weight corresponding to the element of the first alpha% in the sorted significant P value sequenceSetting the magnification factor to be the magnification factor, keeping the weight corresponding to the element of the back (1-alpha%) unchanged and still being 1, and obtaining a new diagonal weight matrix W'; wherein n1 gradients are set to α as { α12,...,αp,...,αn1And setting an amplification function to be logβP, setting n2 gradients [ beta ] for beta12,...,βq,...,βn2Gradient betaqLower front alphap% of the elements corresponds to a magnification ofCalculating a new genetic relationship matrix as T by combining a Vanraden algorithm; if the optimal prediction model is MLM, adding the MLM optimal covariate locus into the MLM model;

step 3.5.2: calculating the genome estimated breeding value GEBV of the validation population in step 3.1.1 using the matrix T and the test population in step 3.1.14And calculating the gradient { alpha }pqLower GEBV4Correlation coefficient C with the true phenotype of the validation population in step 3.1.1pqlThen, the correlation coefficient under all gradients is obtained as { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,l};

Step 3.6: calculation of { Cpq1,Cpq2,...,Cpql,...,CpqLMean value ofGet a third set of mean values ofCalculate the maximum value in the third set of mean values asIf it isOptimizing the genetic relationship matrix G is not needed; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of the elements is less than L multiplied by 95 percent, the genetic relationship matrix G does not need to be optimized; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of elements is greater than or equal to L x 95%, the genetic relationship matrix G needs to be optimized and the optimal pre-percentage is alphapPercent, the optimum amplification function is

Step 3.7: if the genetic relationship matrix G needs to be optimized, the optimal pre-percentage is alphapPercent, the optimum amplification function isThen will beSequencing the final correlation P values from small to large to obtain a new final correlation P value sequenceFor new final associated P value sequenceMiddle front alphap% of the elements the weight corresponding to the kth element is set to the magnificationRear (1-alpha)p%) are not changed and are kept as 1, a diagonal weight matrix is formed, and an optimal character specific inter-individual genetic relationship matrix G' is calculated by combining a Vanraden algorithm.

3. The method for efficient parallel-operable and highly accurate genome wide selection according to claim 2, wherein in the step 3.1.1, the selected model is GBLUP model or fixed-effect model FLM or mixed-effect model MLM.

4. The method for efficient parallel-operable and highly accurate genome wide selection according to claim 2, wherein in the step 3.4, the specified value is a maximum value or a minimum value or a mean value or a median value.

5. The method for efficient parallel-operable and highly accurate genome wide selection according to claim 2, wherein the step 4 comprises the steps of:

step 4.1: if the optimal prediction model is FLM, calculating a genome estimated breeding value of the prediction population by using the FLM optimal covariate locus and the reference population;

step 4.2: if the optimal prediction model is GBLUP and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G and the reference group;

step 4.3: if the optimal prediction model is GBLUP and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G' between individuals with specific optimal traits and the reference group;

step 4.4: if the optimal prediction model is MLM and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the MLM optimal covariate locus, the genetic relationship matrix G and the reference group;

step 4.5: and if the optimal prediction model is MLM and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction population by using the MLM optimal covariate locus, the optimal trait specific inter-individual genetic relationship matrix G' and the reference population.

Technical Field

The invention relates to the technical field of animal and plant breeding and human disease prediction, in particular to a high-efficiency parallel operation and high-accuracy whole genome selection method.

Background

With the development of high-density Single Nucleotide Polymorphism (SNP) genotyping technology covering the whole genome, whole genome selection (prediction) is widely applied to genetic value (species value) prediction and evaluation of complex traits in plant and animal breeding and human genetics research as a powerful tool for genome statistical analysis.

Existing whole genome selection methods fall into two categories: a direct method taking best linear unbiased prediction GBLUP (genomic linear unbiased prediction) of a whole genome as a representative method only needs to construct a genome relation matrix among individuals, and an individual breeding value can be obtained by solving a mixed model after a variance component is obtained; another indirect method represented by Bayes B combines Bayes theory and hidden Markov iteration process to obtain marking effect value, and then accumulates the marking effect according to individual genotype to obtain individual breeding value. The direct method has high calculation efficiency, but because the assumption of the trait genetic construction is simple, the estimated breeding value has poor accuracy; the indirect method has relatively complex hypothesis on the character inheritance construction, better accords with a character inheritance mechanism, and has better prediction accuracy, but as the hypothesis introduces a plurality of unknown parameters, the parameter solving process is extremely complex, the calculation efficiency is poor, and the application of the indirect method in actual prediction is limited.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides the high-efficiency parallel operation and high-accuracy whole genome selection method, which can quickly, accurately and stably predict the individual genome breeding value and improve the accuracy and the calculation efficiency of whole genome selection.

The technical scheme of the invention is as follows:

a high-efficiency parallel operation and high-accuracy whole genome selection method is characterized by comprising the following steps:

step 1: reading an original genotype file and an original phenotype file, extracting genotype data and phenotype data of the same individuals in the original genotype file and the original phenotype file to form a new genotype file and a new phenotype file, and calculating an affinity relationship matrix G between all individuals by using the new genotype file;

step 2: extracting all individuals in the new phenotype file as a reference group, extracting all individuals without phenotype data in the original genotype file as a prediction group to obtain reference group data and prediction group data, and randomly dividing the reference group into M sub-reference groups with the same scale; wherein the reference population data comprises genotype data and phenotype data for each individual in the reference population and the prediction population data comprises genotype data for each individual in the prediction population;

and step 3: performing whole genome association analysis by using the reference group data, and extracting the result characteristics of the whole genome association analysis; constructing a model base with specific characters, sequentially optimizing an optimal fixed effect and an optimal random effect by adopting a cross validation strategy, and selecting an optimal prediction model from the model base;

and 4, step 4: calculating a genomic estimated breeding value for the prediction population using the optimal prediction model.

The step 3 comprises the following steps:

step 3.1: performing correlation coefficient calculation in parallel using the L threads; wherein performing correlation coefficient calculations using the L ∈ {1, 2.,. L } threads comprises:

step 3.1.1:randomly selecting M-1 sub-reference groups to synthesize a test group, taking the unselected sub-reference groups as verification groups, performing whole genome association analysis by using the test groups, and extracting significant P values of all loci in the genotype as { P by using a selected model1l,P2l,...,Pnl,...,PNl}; wherein N is the total number of loci in the genotype, PnlCalculating a significant P value for the nth site in the genotype for the lth thread, N ∈ {1, 2.., N };

step 3.1.2: dividing all the sites by using a window with a preset size according to the distribution sequence on the genome to obtain x sites in the window; sequencing the sites in each window from small to large according to the significant P value, and selecting the site with the maximum significant P value in each window to form a site set X;

step 3.1.3: calculation of the genome estimated breeding value GEBV of the validation population Using the test population and GBLUP model1,GEBV1Calculating GEBV only containing random effect part1Correlation coefficient C with the true phenotype of the validation population0l(ii) a Testing a fixed effect model FLM by using the site set X, sequentially taking out sites one by one without putting back the sites from the site set X, adding the sites into the FLM as covariates, and calculating a genome estimated breeding value GEBV of the verification group2,GEBV2Containing only fixed effect parts, and calculating GEBV2The set of correlation coefficients between true phenotypes of the verification clusters is { Cf1l,Cf2l,...,Cfil,...,Cfxl}; testing a mixed effect model MLM by using the locus set X, sequentially taking out the loci from the locus set X one by one without putting back the loci, adding the loci into the MLM as covariates, and calculating a genome estimated breeding value GEBV of a verification group3,GEBV3Including two parts of fixed effect and random effect, and calculating GEBV3The set of correlation coefficients between true phenotypes of the verification clusters is { Cm1l,Cm2l,...,Cmil,...,Cmxl}; wherein i ∈ {1,2,..., x };

step 3.1.4: if Cfil>Cf,i-1,lAnd Cfil>C0lIf the ith site in the site set X is the FLM effective site, the ith site corresponds toThe window of (A) is an FLM effective window to obtain an FLM effective window set Fl(ii) a If Cmil>Cm,i-1,lAnd Cmil>C0lIf the ith site in the site set X is an MLM effective site, the window corresponding to the ith site is an MLM effective window, and an MLM effective window set M is obtainedl

Step 3.2: calculation of { C01,C02,...,C0l,...,C0LMean value ofCalculation of { Cfi1,Cfi2,...,Cfil,...,CfiLMean value of{Cmi1,Cmi2,...,Cmil,...,CmiLMean value ofi belongs to {1, 2.,. x }, and a first set of mean values is obtainedThe second set of mean values isCalculating the maximum value of the elements in the first set of mean values asMaximum of the elements in the second set of means isIf it isSelecting the optimal prediction model as FLM; if it isSelecting an optimal prediction model as an MLM; if it isAnd isSelecting an optimal prediction model as GBLUP;

step 3.3: if the optimal prediction model is FLM, the L FLM valid window sets { F1,F2,...,Fl,...,FLCounting windows in the FLM, and selecting an FLM effective window with the occurrence frequency of more than or equal to L multiplied by 95 percent as a final selection FLM effective window; if the optimal prediction model is MLM, the L MLM valid window sets { M1,M2,...,Ml,...,MLCounting windows in the window selection method, and selecting an MLM effective window with the occurrence frequency of which is more than or equal to L multiplied by 95 percent as a final selected MLM effective window;

step 3.4: calculating { Pn1,Pn2,...,Pnl,...,PnLThe assigned value of { is used as the final associated P value of the nth siteThe final correlation P value of all the sites is obtained asSelecting a site with the maximum final associated P value from a final FLM effective window as an FLM optimal covariate site or selecting a site with the maximum final associated P value from a final selected MLM effective window as an MLM optimal covariate site;

step 3.5: performing gradient lower relation number calculation in parallel by using L threads; wherein performing gradient-wise correlation computation using the L ∈ {1, 2.., L } threads comprises:

step 3.5.1: if the optimal prediction model is GBLUP or MLM, initializing a diagonal weight matrix W of NxN to diag (W) based on Vanraden algorithm1,w2,...,wN) Diag (1, 1.., 1); significant P values { P ] for all sites obtained in step 3.1.11l,P2l,...,Pnl,...,PNlProceeding in descending orderSequencing to obtain a sequenced significant P value sequence { P1l',P2l',...,Pnl',...,PNl'}, setting the weight corresponding to the first alpha% element in the sorted significant P value sequence as the magnification factor, and keeping the weight corresponding to the rear (1-alpha%) element as 1 to obtain a new diagonal weight matrix W'; wherein n1 gradients are set to α as { α12,...,αp,...,αn1And setting an amplification function to be logβP, setting n2 gradients [ beta ] for beta12,...,βq,...,βn2Gradient betaqLower front alphap% of the elements corresponds to a magnification ofCalculating a new genetic relationship matrix as T by combining a Vanraden algorithm; if the optimal prediction model is MLM, adding the MLM optimal covariate locus into the MLM model;

step 3.5.2: calculating the genome estimated breeding value GEBV of the validation population in step 3.1.1 using the matrix T and the test population in step 3.1.14And calculating the gradient { alpha }pqLower GEBV4Correlation coefficient C with the true phenotype of the validation population in step 3.1.1pqlThen, the correlation coefficient under all gradients is obtained as { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,l};

Step 3.6: calculation of { Cpq1,Cpq2,...,Cpql,...,CpqLMean value ofGet a third set of mean values ofCalculate the maximum value in the third set of mean values asIf it isOptimizing the genetic relationship matrix G is not needed; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of the elements is less than L multiplied by 95 percent, the genetic relationship matrix G does not need to be optimized; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of elements is greater than or equal to L x 95%, the genetic relationship matrix G needs to be optimized and the optimal pre-percentage is alphapPercent, the optimum amplification function is

Step 3.7: if the genetic relationship matrix G needs to be optimized, the optimal pre-percentage is alphapPercent, the optimum amplification function isThen will beSequencing the final correlation P values from small to large to obtain a new final correlation P value sequenceFor new final associated P value sequenceMiddle front alphap% of the elements the weight corresponding to the kth element is set to the magnificationRear (1-alpha)p%) are not changed and are kept as 1, a diagonal weight matrix is formed, and an optimal character specific inter-individual genetic relationship matrix G' is calculated by combining a Vanraden algorithm.

In step 3.1.1, the selected model is a GBLUP model or a fixed effect model FLM or a mixed effect model MLM.

In step 3.4, the specified value is a maximum value, a minimum value, a mean value or a median value.

The step 4 comprises the following steps:

step 4.1: if the optimal prediction model is FLM, calculating a genome estimated breeding value of the prediction population by using the FLM optimal covariate locus and the reference population;

step 4.2: if the optimal prediction model is GBLUP and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G and the reference group;

step 4.3: if the optimal prediction model is GBLUP and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G' between individuals with specific optimal traits and the reference group;

step 4.4: if the optimal prediction model is MLM and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the MLM optimal covariate locus, the genetic relationship matrix G and the reference group;

step 4.5: and if the optimal prediction model is MLM and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction population by using the MLM optimal covariate locus, the optimal trait specific inter-individual genetic relationship matrix G' and the reference population.

The invention has the beneficial effects that:

the invention integrates the prior hypothesis of Bayes theory on the basis of a GBLUP model, combines the information of global genome association analysis (GWAS), adopts strategies of multiple regression, grid search, binary extreme value solving and the like based on parallelizable cross validation, automatically selects the optimal prediction model aiming at different properties, accurately screens large-effect markers to integrate model covariates, simultaneously gives proper weights to different markers to construct an individual relationship matrix, and jointly analyzes complex property genetic construction (genetic architecture), so that individual genome breeding values can be predicted quickly, accurately and stably, and the accuracy and the calculation efficiency of global genome selection are improved.

Drawings

FIG. 1 is a schematic diagram of the efficient parallel computing and highly accurate genome wide selection method of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the method for selecting a genome-wide array capable of efficient parallel operation and high accuracy according to the present invention comprises the following steps:

step 1: reading the original genotype file and the original phenotype file, extracting the genotype data and the phenotype data of the same individuals in the original genotype file and the original phenotype file to form a new genotype file and a new phenotype file, and calculating the genetic relationship matrix G between all the individuals by using the new genotype file.

In this embodiment, an S4 data format in the R language is used to establish data mapping between the disk and the memory; the genotype files and phenotype files are read and stored in the big matrix format of the R CRAN software package, and the conversions of the four data formats of the plink format [ ped/. map,. bed/. bim/. fam ], the hapmap format, the VCF format and the numeric format [0/1/2] are provided. And checking and adjusting the sequence of individuals in the original genotype file and the original phenotype file to be consistent, and selecting and reserving individuals existing in the two files at the same time to obtain a new phenotype file and a new genotype file.

Step 2: extracting all individuals in the new phenotype file as a reference group, extracting all individuals without phenotype data in the original genotype file as a prediction group to obtain reference group data and prediction group data, and randomly dividing the reference group into M sub-reference groups with the same scale; wherein the reference population data comprises genotype data and phenotype data for each individual in the reference population and the prediction population data comprises genotype data for each individual in the prediction population.

And step 3: performing whole genome association analysis by using the reference group data, and extracting the result characteristics of the whole genome association analysis; and constructing a model base with specific characters, sequentially optimizing the optimal fixed effect and the optimal random effect by adopting a cross validation strategy, and selecting an optimal prediction model from the model base.

The step 3 comprises the following steps:

step 3.1: performing correlation coefficient calculation in parallel using the L threads; wherein performing correlation coefficient calculations using the L ∈ {1, 2.,. L } threads comprises:

step 3.1.1: randomly selecting M-1 sub-reference groups to synthesize a test group, taking the unselected sub-reference groups as verification groups, performing whole genome association analysis by using the test groups, and extracting significant P values of all loci in the genotype as { P by using a selected model1l,P2l,...,Pnl,...,PNl}; wherein N is the total number of loci in the genotype, PnlCalculating a significant P value for the nth site in the genotype for the lth thread, N ∈ {1, 2.., N }; in this embodiment, the selected model is a GBLUP model, a fixed effect model FLM, or a mixed effect model MLM;

step 3.1.2: dividing all the sites by using a window with a preset size according to the distribution sequence on the genome to obtain x sites in the window; sequencing the sites in each window from small to large according to the significant P value, and selecting the site with the maximum significant P value in each window to form a site set X; in this embodiment, the window size is 1 MB;

step 3.1.3: calculation of the genome estimated breeding value GEBV of the validation population Using the test population and GBLUP model1,GEBV1Calculating GEBV only containing random effect part1Correlation coefficient C with the true phenotype of the validation population0l(ii) a Testing a fixed effect model FLM (fixed effect Linear model) by using the site set X, sequentially taking out sites one by one from the site set X without putting back, adding the sites into the FLM as covariates, and calculating a genome estimated breeding value GEBV of the verification group2,GEBV2Containing only fixed effect parts, and calculating GEBV2The set of correlation coefficients between true phenotypes of the verification clusters is { Cf1l,Cf2l,...,Cfil,...,Cfxl}; using the locus set X to test a mixed effect model MLM (Mixed effect Linear model), sequentially taking out loci from the locus set X one by one without putting back the loci, adding the loci into the MLM as covariates, and calculating a genome estimated breeding value GEBV of a verification group3,GEBV3Including two parts of fixed effect and random effect, and calculating GEBV3The set of correlation coefficients between true phenotypes of the verification clusters is { Cm1l,Cm2l,...,Cmil,...,Cmxl}; wherein i ∈ {1,2,..., x };

step 3.1.4: if Cfil>Cf,i-1,lAnd Cfil>C0lIf the ith site in the site set X is an FLM effective site, the window corresponding to the ith site is an FLM effective window, and an FLM effective window set F is obtainedl(ii) a If Cmil>Cm,i-1,lAnd Cmil>C0lIf the ith site in the site set X is an MLM effective site, the window corresponding to the ith site is an MLM effective window, and an MLM effective window set M is obtainedl

Step 3.2: calculation of { C01,C02,...,C0l,...,C0LMean value ofCalculation of { Cfi1,Cfi2,...,Cfil,...,CfiLMean value of{Cmi1,Cmi2,...,Cmil,...,CmiLMean value ofi belongs to {1, 2.,. x }, and a first set of mean values is obtainedThe second set of mean values isCalculating the maximum value of the elements in the first set of mean values asMaximum of the elements in the second set of means isIf it isSelecting the optimal prediction model as FLM; if it isSelecting an optimal prediction model as an MLM; if it isAnd isThe optimal prediction model is selected as GBLUP.

Step 3.3: if the optimal prediction model is FLM, the L FLM valid window sets { F1,F2,...,Fl,...,FLCounting windows in the FLM, and selecting an FLM effective window with the occurrence frequency of more than or equal to L multiplied by 95 percent as a final selection FLM effective window; if the optimal prediction model is MLM, the L MLM valid window sets { M1,M2,...,Ml,...,MLCounting windows in the window selection method, and selecting an MLM effective window with the occurrence frequency of being more than or equal to L multiplied by 95 percent as a final selection MLM effective window.

Step 3.4: calculating { Pn1,Pn2,…,Pnl,…,PnLThe assigned value of { is used as the final associated P value of the nth siteThe final correlation P value of all the sites is obtained asAnd selecting the site with the maximum final associated P value from the final FLM effective window as the FLM optimal covariate site or selecting the site with the maximum final associated P value from the final MLM effective window as the MLM optimal covariate site. Wherein the specified value is a maximum value or a minimum value or a mean or a median value. In this embodiment, the specified value is the maximum value, i.e., calculating { P }n1,Pn2,…,Pnl,…,PnLThe maximum value of the position is used as the final associated P value of the nth position

Step 3.5: performing gradient lower relation number calculation in parallel by using L threads; wherein performing gradient-downward correlation coefficient calculations using the L ∈ {1,2, …, L } th thread comprises:

step 3.5.1: if the optimal prediction model is GBLUP or MLM, initializing a diagonal weight matrix W of NxN to diag (W) based on Vanraden algorithm1,w2,…,wN) Biag (1,1, …, 1); significant P values { P ] for all sites obtained in step 3.1.11l,P2l,…,Pnl,…,PNlSorting according to the sequence from small to large to obtain a sorted significant P value sequence (P)1l',P2l',…,Pnl',...,PNl'}, setting the weight corresponding to the first alpha% element in the sorted significant P value sequence as the magnification factor, and keeping the weight corresponding to the rear (1-alpha%) element as 1 to obtain a new diagonal weight matrix W'; wherein n1 gradients are set to α as { α12,...,αp,...,αn1And setting an amplification function to be logβP, toBeta set n2 gradients [ beta ]12,...,βq,...,βn2Gradient betaqLower front alphap% of the elements corresponds to a magnification ofCalculating a new genetic relationship matrix as T by combining a Vanraden algorithm; if the optimal prediction model is MLM, adding the MLM optimal covariate locus into the MLM model;

step 3.5.2: calculating the genome estimated breeding value GEBV of the validation population in step 3.1.1 using the matrix T and the test population in step 3.1.14And calculating the gradient { alpha }pqLower GEBV4Correlation coefficient C with the true phenotype of the validation population in step 3.1.1pqlThen, the correlation coefficient under all gradients is obtained as { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,l}。

In this embodiment, n1 gradients are set to α as {0.01,0.1,. 1,5 }; n2 gradients {1.015,1.035,. 3,5,10} are set for β.

Step 3.6: calculation of { Cpq1,Cpq2,...,Cpql,...,CpqLMean value ofGet a third set of mean values ofCalculate the maximum value in the third set of mean values asIf it isOptimizing the genetic relationship matrix G is not needed; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of the elements is less than L multiplied by 95 percent, the genetic relationship matrix G does not need to be optimized; if it isAnd { C11l,C12l,...,C1,n2,l,...,Cpql,Cp,q+1,l,...,Cp,n2,l,...,Cn1,1,l,Cn1,2,l,...,Cn1,n2,lI ∈ C is satisfied in {1, 2., L } }pql>C0lIf the number of elements is greater than or equal to L x 95%, the genetic relationship matrix G needs to be optimized and the optimal pre-percentage is alphapPercent, the optimum amplification function is logβq(P)。

Step 3.7: if the genetic relationship matrix G needs to be optimized, the optimal pre-percentage is alphapPercent, the optimum amplification function isThen will beSequencing the final correlation P values from small to large to obtain a new final correlation P value sequenceFor new final associated P value sequenceMiddle front alphap% of the elements the weight corresponding to the kth element is set to the magnificationRear (1-alpha)p%) are not changed and are kept as 1, a diagonal weight matrix is formed, and the calculation is combined with a Vanraden algorithmAnd (3) an optimal character specific inter-individual genetic relationship matrix G'.

And 4, step 4: calculating a genome estimated breeding value of the prediction population by using the optimal prediction model, and specifically comprising the following steps:

step 4.1: if the optimal prediction model is FLM, calculating a genome estimated breeding value of the prediction population by using the FLM optimal covariate locus and the reference population;

step 4.2: if the optimal prediction model is GBLUP and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G and the reference group;

step 4.3: if the optimal prediction model is GBLUP and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction group by using the genetic relationship matrix G' between individuals with specific optimal traits and the reference group;

step 4.4: if the optimal prediction model is MLM and the genetic relationship matrix G does not need to be optimized, calculating the genome estimated breeding value of the prediction group by using the MLM optimal covariate locus, the genetic relationship matrix G and the reference group;

step 4.5: and if the optimal prediction model is MLM and the genetic relationship matrix G needs to be optimized, calculating the genome estimated breeding value of the prediction population by using the MLM optimal covariate locus, the optimal trait specific inter-individual genetic relationship matrix G' and the reference population.

It is to be understood that the above-described embodiments are only a few embodiments of the present invention, and not all embodiments. The above examples are only for explaining the present invention and do not constitute a limitation to the scope of protection of the present invention. All other embodiments, which can be derived by those skilled in the art from the above-described embodiments without any creative effort, namely all modifications, equivalents, improvements and the like made within the spirit and principle of the present application, fall within the protection scope of the present invention claimed.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于增强采样分子动力学模拟的混合拟、抗雌激素干扰物的识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!