【源码解析】Logistic Regression

最新推荐文章于 2023-11-28 00:08:13 发布

原创

最新推荐文章于 2023-11-28 00:08:13 发布 · 760 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#LogisticRegression #spark #源码

本文详细解析了Spark MLlib中的逻辑回归实现，包括LogisticCostFun、binaryUpdateInPlace、multinomialUpdateInPlace、LogisticAggregator和train五个关键部分。讨论了二分类和多分类模型的损失函数、参数更新方法、分布式训练聚合以及训练过程中的数值稳定性问题。还提到了未使用特征标准化对训练的影响和初始化截距对收敛速度的作用。

摘要：本文是spark ml的逻辑回归实现，共包括5个部分，分别为LogisticCostFun、binaryUpdataInPlace、multinomialUpdataInPlace、LogisticAggregator和train。其中，
（一）LogisticCostFun：计算损失函数
（二）binaryUpdataInPlace：对2分类模型，进行参数更新
（三）multinomialUpdataInPlace：对多分类模型，进行参数更新
（四）LogisticAggregator：对分布式的LR，进行整合
（五）train：训练
得到结论如下：
1）spark 只对样本实现了分布式，并没有对参数实现分布式
2）样本增加时，训练不会有问题
3）参数增加时，训练会有影响。（spark还是将LR的所有参数广播到各个节点）

（一） LogisticCostFun

/**
 * LogisticCostFun implements Breeze's DiffFunction[T] for a multinomial (softmax) logistic loss
 * function, as used in multi-class classification (it is also used in binary logistic regression).
 * It returns the loss and gradient with L2 regularization at a particular point (coefficients).
 * It's used in Breeze's convex optimization routines.
 */
private class LogisticCostFun(
    instances: RDD[Instance],
    numClasses: Int,
    fitIntercept: Boolean,
    standardization: Boolean,
    bcFeaturesStd: Broadcast[Array[Double]],
    regParamL2: Double,
    multinomial: Boolean,
    aggregationDepth: Int) extends DiffFunction[BDV[Double]] {
   
   

  override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
    val coeffs = Vectors.fromBreeze(coefficients)
    val bcCoeffs = instances.context.broadcast(coeffs)
    val featuresStd = bcFeaturesStd.value
    val numFeatures = featuresStd.length
    val numCoefficientSets = if (multinomial) numClasses else 1
    val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures

    val logisticAggregator = {
      val seqOp = (c: LogisticAggregator, instance: Instance) => c.add(instance)
      val combOp = (c1: LogisticAggregator, c2: LogisticAggregator) => c1.merge(c2)

      instances.treeAggregate(
        new LogisticAggregator(bcCoeffs, bcFeaturesStd, numClasses, fitIntercept,
          multinomial)
      )(seqOp, combOp, aggregationDepth)
    }

    val totalGradientMatrix = logisticAggregator.gradient
    val coefMatrix = new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, coeffs.toArray)
    // regVal is the sum of coefficients squares excluding intercept for L2 regularization.
    val regVal = if (regParamL2 == 0.0) {
      0.0
    } else {
      var sum = 0.0
      coefMatrix.foreachActive { case (classIndex, featureIndex, value) =>
        // We do not apply regularization to the intercepts
        val isIntercept = fitIntercept && (featureIndex == numFeatures)
        if (!isIntercept) {
          // The following code will compute the loss of the regularization; also
          // the gradient of the regularization, and add back to totalGradientArray.
          sum += {
            if (standardization) {
              val gradValue = totalGradientMatrix(classIndex, featureIndex)  //取梯度矩阵中(classIndex, featureIndex)对应的值
              totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * value)  //正则项的梯度为regParamL2 * value
              value * value  //正则项的loss为value * value
            } else {
              if (featuresStd(featureIndex) != 0.0) {
                // If `standardization` is false, we still standardize the data
                // to improve the rate of convergence; as a result, we have to
                // perform this reverse standardization by penalizing each component
                // differently to get effectively the same objective function when
                // the training dataset is not standardized.
                val temp = value / (featuresStd(featureIndex) * featuresStd(featureIndex))  //梯度除以相应类别的特征的方差
                val gradValue = totalGradientMatrix(classIndex, featureIndex)
                totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * temp)
                value * temp  //loss也相应除以方差
              } else {  //如果该类别的特征的标准差为0.0，不需要scaling
                0.0
              }
            }
          }
        }
      }
      0.5 * regParamL2 * sum  //正则项总的loss
    }
    bcCoeffs.destroy(blocking = false)

    (logisticAggregator.loss + regVal, new BDV(totalGradientMatrix.toArray))  //返回(loss, gradient)
  }
}

（二）binaryUpdateInPlace

/** Update gradient and loss using binary loss function. */
  private def binaryUpdateInPlace(
      features: Vector,
      weight: Double,
      label: Double): Unit = {

    val localFeaturesStd = bcFeaturesStd.value
    val localCoefficients = bcCoefficients.value
    val localGradientArray = gradientSumArray
    val margin = - {
      var sum = 0.0
      features.foreachActive { (index, value) =>
        if (localFeaturesStd(index) != 0.0 && value != 0.0) {
          sum += localCoefficients(index) * value / localFeaturesStd(index)  // sum = x * W^{T}
        }
      }
      if (fitIntercept) sum += localCoefficients(numFeaturesPlusIntercept - 1) // sum = x * W^{T} + b
      sum
    }

    val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - label)  //weight是该样本的权重

    features.foreachActive { (index, value) =>
      if (localFeaturesStd(index) != 0.0 && value != 0.0) {
        localGradientArray(index) += multiplier * value / localFeaturesStd(index)  //更新梯度
      }
    }

    if (fitIntercept) {
      localGradientArray(numFeaturesPlusIntercept - 1) += multiplier
    }  //更新截距b对应的梯度

    if (label > 0) {
      // 在MLUtils.log1pExp(x)实现中，if(x > 0) log(1+exp(x)) = x + log(1+exp(-x))
      lossSum += weight * MLUtils.log1pExp(margin)  //loss = log(1+exp(margin))
    } else {
      lossSum += weight * (MLUtils.log1pExp(margin) - margin) //loss = margin - log(1+exp(margin))
    }
  }

（三）multinomialUpdateInPlace

Note that there is a difference between multinomial (softmax) and binary loss. The binary case uses one outcome class as a “pivot” and regresses the other class against the pivot. In the multinomial case, the softmax loss function is used to model each class probability independently. Using softmax loss produces K sets of coefficients, while using a pivot class produces K - 1 sets of coefficients (a single coefficient vector in the binary case). In the binary case, we can say that the coefficients are shared between the positive and negative classes. When regularization is applied, multinomial (softmax) loss will produce a result different from binary loss since the positive and negative don’t share the coefficients while the binary regression shares the coefficients between positive and negative.

The following is a mathematical derivation for the multinomial (softmax) loss. The probability of the multinomial outcome $y$ taking on any of the K possible outcomes is:

$P (y_{i} =$

最低0.47元/天解锁文章