Weka算法Classifier-tree-J48源码分析（二）ClassifierTree

本文链接：https://blog.youkuaiyun.com/ROger__wonG/article/details/39119701

本文详细探讨了Weka的ClassifierTree和C45PruneableClassifierTree的实现，包括如何处理缺失值、离散化连续值和剪枝策略。通过对`BuildClassifier`方法的分析，揭示了分类树构建过程中的关键步骤，如`buildTree`、`collapse`和`prune`函数。同时，文章解答了如何控制分类树精度及分类树剪枝的依据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、问题

主要带着四个问题去研究J48的实现。

1、如何控制分类树的精度。

2、如何处理缺失的值（MissingValue）

3、如何对连续值进行离散化。

4、如何进行分类树的剪枝。

二、BuildClassifier

每一个分类器都会实现这个方法，传入一个Instances对象，在这个对象基础上进行来构建分类树。核心代码如下：

public void buildClassifier(Instances instances) 
       throws Exception {

    ModelSelection modSelection;	 

    if (m_binarySplits)
      modSelection = new BinC45ModelSelection(m_minNumObj, instances);
    else
      modSelection = new C45ModelSelection(m_minNumObj, instances);
    if (!m_reducedErrorPruning)
      m_root = new C45PruneableClassifierTree(modSelection, !m_unpruned, m_CF,
					    m_subtreeRaising, !m_noCleanup);
    else
      m_root = new PruneableClassifierTree(modSelection, !m_unpruned, m_numFolds,
					   !m_noCleanup, m_Seed);
    m_root.buildClassifier(instances);
    if (m_binarySplits) {
      ((BinC45ModelSelection)modSelection).cleanup();
    } else {
      ((C45ModelSelection)modSelection).cleanup();
    }
  }

可以看到这段代码逻辑非常清楚，首先根据是否是一个二分树（即每个节点只有是否两种选择）来构造一个ModelSelection，随后根据是否有m_reduceErrorPruning标志来构造相应的ClassifierTree，在这个tree上真正的构建模型，最后清理数据（主要是做释放指针的工作，防止Tree持有Instances指针导致GC不能在上层调用者想释放Instances的时候进行释放）。

三、C45PruneableClassifierTree

（1）该类也实现了BuildCClassifier方法来构建分类器，先看一下这个方法的主逻辑，代码如下：

  public void buildClassifier(Instances data) throws Exception {

    // can classifier tree handle the data?
    getCapabilities().testWithFail(data);

    // remove instances with missing class
    data = new Instances(data);
    data.deleteWithMissingClass();
    
   buildTree(data, m_subtreeRaising || !m_cleanup);
   collapse();
   if (m_pruneTheTree) {
     prune();
   }
   if (m_cleanup) {
     cleanup(new Instances(data, 0));
   }
  }

首先testWithFail是检测一下传入的data是否能用该分类器进行分类，比如C45只能对要分类的属性的取值是离散值的Instance