Spark MLlib算法

最新推荐文章于 2024-01-08 16:43:35 发布

余音丶未散

最新推荐文章于 2024-01-08 16:43:35 发布

阅读量976

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark

本文链接：https://blog.youkuaiyun.com/q383700092/article/details/54293756

Spark 专栏收录该内容

15 篇文章

订阅专栏

本文介绍如何使用Apache Spark的MLlib库进行机器学习任务，包括分类、回归及评估模型性能。涵盖支持向量机(SVM)、逻辑回归(Logistic Regression)、线性回归(Linear Regression)等算法的应用实例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark MLlib算法

官方文档

Mathematical formulation数学公式

Loss functions损失函数

hinge loss
logistic loss
squared loss

Regularizers正则化

L1
L2
elastic net
zero (unregularized)

Optimization优化

spark使用 SGD 和 L-BFGS 这两种梯度下降方法

libSVM的数据格式

Label 1:value 2:value ….

Label：是类别的标识，比如上节train.model中提到的1 -1，你可以自己随意定，比如-10，0，15。当然，如果是回归，这是目标值，就要实事求是了。

Value：就是要训练的数据，从分类的角度来说就是特征值，数据之间用空格隔开

比如: -15 1:0.708 2:1056 3:-0.3333

需要注意的是，如果特征值为0，特征冒号前面的(姑且称做序号)可以不连续。如：
-15 1:0.708 3:-0.3333

表明第2个特征值为0，从编程的角度来说，这样做可以减少内存的使用，并提高做矩阵内积时的运算速度。我们平时在matlab中产生的数据都是没有序号的常规矩阵，所以为了方便最好编一个程序进行转化。

分类

线性分类有线性支持向量机svm和LR，其中线性支持向量机svm只支持二分类，LR支持二分类和多分类

Linear Support Vector Machines (SVMs)线性支持向量机

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "/Users/yuyin/Downloads/software/spark/spark-1.6.2-bin-hadoop1-scala2.11/data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Clear the default threshold.
model.clearThreshold()
// Compute raw scores on the test set.计算分数值
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}
// Get evaluation metrics.获取评测指标
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

// Save and load model
model.save(sc, "target/tmp/scalaSVMWithSGDModel")
val sameModel = SVMModel.load(sc, "target/tmp/scalaSVMWithSGDModel")

添加正则项

import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200).setRegParam(0.1).setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)

Logistic regression

import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .run(training)

// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val accuracy = metrics.accuracy
println(s"Accuracy = $accuracy")

// Save and load model
model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
val sameModel = LogisticRegressionModel.load(sc,
  "target/tmp/scalaLogisticRegressionWithLBFGSModel")

Regression回归

Linear least squares, Lasso, and ridge regression

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)

// Save and load model
model.save(sc, "target/tmp/scalaLinearRegressionWithSGDModel")
val sameModel = LinearRegressionModel.load(sc, "target/tmp/scalaLinearRegressionWithSGDModel")

RidgeRegressionWithSGD 和LassoWithSGD可以以类似的方式被使用LinearRegressionWithSGD。
spark.mllib实现的随机梯度下降（SGD）的一个简单的分布式版本

Algorithms are all implemented in Scala:
1. SVMWithSGD
2. LogisticRegressionWithLBFGS
3. LogisticRegressionWithSGD
4. LinearRegressionWithSGD
5. RidgeRegressionWithSGD
6. LassoWithSGD

Logistic回归选择

L-BFGS支持二进制和多项Logistic回归,SGD版本只支持二分类，L-BFGS版本不支持L1正则化，但SGD 1支持L1正则化，当不需要L1正则化时，强烈推荐L-BFGS版本，因为通过使用准牛顿法近似逆Hessian矩阵，与SGD相比，它收敛得更快和更准确
官方文档

Streaming linear regression

在线回归模型

import org.apache.spark.mllib.linalg.Vectors 
import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache() 
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

val numFeatures = 3 
val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures))

model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start() 
ssc.awaitTermination()

评价标准

官方文档

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")

// Split data into training (60%) and test (40%)
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(2)
  .run(training)

// Clear the prediction threshold so the model will return probabilities
model.clearThreshold

// Compute raw scores on the test set
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)

// Precision by threshold
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
  println(s"Threshold: $t, Precision: $p")
}

// Recall by threshold
val recall = metrics.recallByThreshold
recall.foreach { case (t, r) =>
  println(s"Threshold: $t, Recall: $r")
}

// Precision-Recall Curve
val PRC = metrics.pr

// F-measure
val f1Score = metrics.fMeasureByThreshold
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 1")
}

val beta = 0.5
val fScore = metrics.fMeasureByThreshold(beta)
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 0.5")
}

// AUPRC
val auPRC = metrics.areaUnderPR
println("Area under precision-recall curve = " + auPRC)

// Compute thresholds used in ROC and PR curves
val thresholds = precision.map(_._1)

// ROC Curve
val roc = metrics.roc

// AUROC
val auROC = metrics.areaUnderROC
println("Area under ROC = " + auROC)