Phylobayes做Cross-Validation

Phylobayes CV实战

最新推荐文章于 2021-09-07 07:00:00 发布

翻译最新推荐文章于 2021-09-07 07:00:00 发布 · 748 阅读

文章标签：

#Phylobayes #cross-validation

生物信息专栏收录该内容

32 篇文章

订阅专栏

本文详细介绍了使用Phylobayes进行交叉验证的方法，包括理论基础、操作流程及参数设置建议。通过将数据集划分为训练集和测试集，评估模型的预测能力，并通过多次重复实验来提高结果的可靠性。

Phylobayes做Cross-Validation

原理
Cross-validation (CV) is a general method for evaluating the fit of alternative models. The rationale is as follows: the dataset is randomly split into two (possibly unequal) parts, the training (or learning) set and the test set. The parameters of the model are estimated on the learning set (i.e. the model is ’trained’ on this subset of empirical observations), and these parameter values are then used to compute the likelihood of the test set (which measures how well the test set is ’predicted’ by the model). The overall procedure has to be repeated (and the resulting log likelihood scores averaged) over several random splits
CV用来评估最适替换模型，原理是将数据集分为训练集和测试集，用训练集去估计模型参数，然后将这些参数用于测试集，去计算似然值。该过程需要多次重复，计算出的似然值取平均输出。

Typically, 10-fold cross-validation (such that D2 represents 10% and D1 90% of the original dataset) has been used (e.g. Philippe et al., 2011), and ten replicates have been run (although ideally, 100 replicates would certainly be more adequate). However, alternative schemes are possible.
用户手册推荐训练集10%、测试集90%的分法（10 fold），重复10次

操作流程

cvrep: prepare the replicates
pb: run each model under each replicated learning set
readcv: compute the cross-validation scores on each replicate
sumcv: pool the cv-scores and combine them into a global scoring of the models

Step I:

cvrep -nrep 10 -nfold 10 -d 13PCG123.phy pcg

生成10对learn和test文件

Step II:

pb -d PCG0_learn.ali -T tree.nwk -x 10 11000 CATpcg0_learn.ali
pb -d PCG0_learn.ali -T tree.nwk -x 1 1100 -wag WAGpcg0_learn.ali

运行完全部的10个 PCG*_learn.ali文件

Step III: Calculate cross-validated likelihoods

readcv -nrep 10 -x 100 1 CAT pcg
readcv -nrep 10 -x 100 1 WAG pcg

Note that, when used with the -nrep option such as above, readcv will process each replicate successively, which may take a very long time. Alternatively readcv can be called on individual replicates. For instance:
readcv -rep 2 -x 100 10 CAT pcg

Step IV: Average the cv-log-likelihood scores over replicates

sumcv -nrep 10 WAG CAT pcg
sumcv -nrep 10 WAG CAT GTR pcg

The first model of the list (here WAG) as the reference