Weka开发 —J48源代码介绍
这次介绍一下J48的源码,分析J48的源码似乎真还是有用的,同学改造J48写过VFDT,我自己用J48进行特征选择(当然很失败)。
J48的buildClassfier函数:
public voidbuildClassifier(Instances instances)throws Exception {
ModelSelection modSelection;
if (m_binarySplits)
modSelection = newBinC45ModelSelection(m_minNumObj, instances);
else
modSelection = newC45ModelSelection(m_minNumObj, instances);
if (!m_reducedErrorPruning)
m_root = new C45PruneableClassifierTree(modSelection,
!m_unpruned, m_CF, m_subtreeRaising, !m_noCleanup);
else
m_root = new PruneableClassifierTree(modSelection, !m_unpruned,
m_numFolds, !m_noCleanup, m_Seed);
m_root.buildClassifier(instances);
if (m_binarySplits) {
((BinC45ModelSelection) modSelection).cleanup();
} else {
((C45ModelSelection) modSelection).cleanup();
}
}
在NBTree中已经介绍过了,ModelSelection是决定决策树的模型类,前面两个if,一个是判断连续属性是否只分出两个子结点,另一个判断是否最后剪枝。m_root是一个ClassifierTree对象,它调用buildClassifier函数。这里列出这个函数:
public void buildClassifier(Instances data) throws Exception {
// can classifier tree handle the data?
getCapabilities().testWithFail(data);
// remove instances with missing class
data = new Instances(data);
data.deleteWithMissingClass();
buildTree(data, false);
}
有注释也没什么好说的,直接看最后一个函数buildTree:
public void buildTree(Instances data, boolean keepData) throws Exception {
Instances[] localInstances;
if (keepData) {
m_train = data;
}
m_test = null;
m_isLeaf = false;
m_isEmpty = false;
m_sons = null;
m_localModel = m_toSelectModel.selectModel(data);
if (m_localModel.numSubsets() > 1) {
localInstances = m_localModel.split(data);
data = null;
m_sons = new ClassifierTree[m_localModel.numSubsets()];
for (int i = 0; i < m_sons.length; i++) {
m_sons[i] = getNewTree(localInstances[i]);
localInstances[i] = null;
}
} else {
m_isLeaf = true;
if (Utils.eq(data.sumOfWeights(), 0))
m_isEmpty = true;
data = null;
}
}
这里的selectModel函数,如果看过NBTree一篇的读者应该不会太陌生,selectModel简单地说就是如果不符合分裂的条件就返回NoSplit,如果符合分裂的条件,则从currentModel数组中选出bestModel返回。
这最要注意的是selectModel也不只是决定哪个属性分裂,其实到底如何分裂已经在这个函数里算里出来了。
我把selectModel拆开来讲解
// Check if all Instances belong to one class or if not
// enough Instances to split.
checkDistribution = new Distribution(data);
noSplitModel = new NoSplit(checkDistribution);
if (Utils.sm(checkDistribution.total(), 2 * m_minNoObj)
|| Utils.eq(checkDistribution.total(), checkDistribution
.perClass(checkDistribution.maxClass())))
return noSplitModel;
2 * m_minNoObj表示至有有这么多样本才可以分裂,原因很简单,因为一个结点至少分出两个子结点,每个子结点至少有m_minNoObj个样本,第二个是或条件是表示是否这个结点上所有的样本都属于同一类别,也就是这个结点总的权重是否等于这个最多类别的权重。
// Check if all attributes are nominal and have a lot of values.
if (m_allData != null) {
Enumeration enu = data.enumerateAttributes();
while (enu.hasMoreElements()) {
attribute = (Attribute) enu.nextElement();
if ((attribute.isNumeric())
|| (Utils.sm((double) attribute.numValues(),
(0.3 * (double) m_allData.numInstances())))) {
multiVal = false;
break;
}
}
}
判断是否有很多不同的属性值,标准就是如果有一个属性的属性值小多于总样本数*0.3,那么就是不是multiVal。
currentModel = new C45Split[data.numAttributes()];
sumOfWeights = data.sumOfWeights();
// For each attribute.
for (i = 0; i < data.numAttributes(); i++) {
// Apart from class attribute.
if (i != (data).classIndex()) {
// Get models for current attribute.
currentModel[i] = new C45Split(i, m_minNoObj, sumOfWeights);
currentModel[i].buildClassifier(data);
// Check if useful split for current attribute
// exists and check for enumerated attributes with