书接上回,该分析run方法了,有1000多行,该方法主要是根据数据和参数,训练生成一组树,就是决策森林
开始先干了一件事
val metadata =
DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, numTrees, featureSubsetStrategy)
这里构建决策树的元数据
private[spark] class DecisionTreeMetadata(
val numFeatures: Int,
val numExamples: Long,//输入数据数量
val numClasses: Int,//对于分类label是0到n-1,对于回归这个值是0
val maxBins: Int,
val featureArity: Map[Int, Int],//这是之前的categoricalFeatures
val unorderedFeatures: Set[Int],//未排序的属性索引集
val numBins: Array[Int],
val impurity: Impurity,
val quantileStrategy: QuantileStrategy,//分位数计算策略,包括这三种Sort, MinMax, ApproxHist近似直方图
val maxDepth: Int,
val minInstancesPerNode: Int,
val minInfoGain: Double,
val numTrees: Int,
val numFeaturesPerNode: Int) extends Serializable {
这个类里面包含一堆成员,基本都是笔记一里面提到的参数
类里面有这么几个函数,比较简单,但是很重要
def isUnordered(featureIndex: Int): Boolean = unorderedFeatures.contains(featureIndex) //该属性是否排序
def isClassification: Boolean = numClasses >= 2//是否是分类
def isMulticlass: Boolean = numClasses > 2//是否是多类分类
def isMulticlassWithCategoricalFeatures: Boolean = isMulticlass && (featureArity.size > 0)//是否是带有类别变量的多类分类
def isCategorical(featureIndex: Int): Boolean = featureArity.contains(featureIndex)//该属性是否是类别变量
def isContinuous(featureIndex: Int): Boolean = !featureArity.contains(featureIndex)//该属性是否是连续变量
def numSplits(featureIndex: Int): Int = if (isUnordered(featureIndex)) {
numBins(featureIndex) >> 1
} else {
numBins(featureIndex) - 1
} //该属性有几个划分,如果是有序属性就是numBins-1,非有序属性numBins的一半,非有序就是abcd这种
def setNumSplits(featureIndex: Int, numSplits: Int) {
require(isContinuous(featureIndex),
s"Only number of bin for a continuous feature can be set.")
numBins(featureIndex) = numSplits + 1
}//设置连续属性的划分数
def subsamplingFeatures: Boolean = numFeatures != numFeaturesPerNode//是否使用属性采样
DecisionTreeMetadata伴生对象有三个方法,最重要的就是我们调用的
def buildMetadata(
input: RDD[LabeledPoint],
strategy: Strategy,
numTrees: Int,
featureSubsetStrategy: String): DecisionTreeMetadata = {
先来介绍一下这个
/**
* Given the arity of a categorical feature (arity = number of categories),
* return the number of bins for the feature if it is to be treated as an unordered feature.
* There is 1 split for every partitioning of categories into 2 disjoint, non-empty sets;
* there are math.pow(2, arity - 1) - 1 such splits.
* Each split has 2 corresponding bins.
*/
def numUnorder