初学者入门-用Spark ML来处理超大数据

最新推荐文章于 2025-07-09 17:43:31 发布

原创最新推荐文章于 2025-07-09 17:43:31 发布 · 4.6k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #Spark #Scala #机器学习 #eclipse

数据挖据同时被 3 个专栏收录

14 篇文章

订阅专栏

大数据

11 篇文章

订阅专栏

Hadoop

4 篇文章

订阅专栏

本文介绍如何使用Spark ML在单机上处理超出内存的大数据，以建立预测帖子标签的模型。通过组合帖子标题和内容进行预测，并强调在StackOverflow等平台中recall作为评估指标的重要性。数据来源为stackexchange的posts.xml，通过Scala实现，环境配置包括Spark 1.5.2和Hadoop 2.6，最终将代码打包为jar在spark-submit中运行。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

还是转译KDNuggets的文章。微软的Dmitry Petrov介绍的如何用Spark ML来处理超过内存大小的数据。原文的 Link

这里侧重的是数据的大小远远超过单机的内存大小。原来这样的分析都是要用分布式的系统（比如hadoop）上来实现，而这篇文章里介绍的是单机如何通过Spark来实现分析。不过自己做了很多的migration，所以就算是原创啦。

本文所要介绍的案例的目的是要建立一个预测模型来基于帖子的标题和内容来预测一个帖子的标签（Tag）。处于简化代码的目的，文章里会把这两field组合成一个文字列来处理，而不是分别处理。（译者注：很明显，标题里的文字对于预测标签的权重应该更大，所以现实工作中，我们应该是分别对待这两个列）。

很容易理解这个预测模型对于stackoverflow.com这样的网站的价值。用户输入一个问题，网站会自动的给出标签的建议。假定我们需要尽可能多的正确的标签，这样用户可以删掉那些不相关的标签。基于这样的假定，我们就可以使用recall来作为检验模型好坏的最重要的依据了。

首先是要找这样的一个数据，文章里用的是在aXive上的stackflow的posts.xml文件，链接是https://archive.org/details/stackexchange。同时作者也提供了一个小文件来给大家做练习，链接在https://www.dropbox.com/s/n2skgloqoadpa30/Posts.small.xml?dl=0，（需要注意的是，是国内访问不了这两个网站，所以我把第二个小文件放到云盘里面供下载，地址：http://pan.baidu.com/s/1jGJFtQI，链接。第一个文件需要大家自己翻墙去找）

下面要做的事情就是配置Spark环境。原文的例子是跑在Spark 1.5.1的环境下，我实际用的是1.5.2的配合Hadoop 2.6。具体的安装步骤参考另外一个博文：hadoop集群的搭建脚本及构思（N）：一个简化的Hadoop+Spark on Yarn集群快速搭建，

原文是在spark-shell里面的一堆基于scala的命令。出于一个软件工程师的偏好，我把这些命令放到了一个Eclipse的scala的项目里来实现，又碰到一堆坑。下面直接上代码，需要修改的地方都用注释来解释了。

然后Export成一个jar包，用spark-submit放进去跑了。结果还碰到一个诡异的问题，不过已经解决了。见其他的博文。

//import scala.xml._

// Spark data manipulation libraries
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark._

// Spark machine learning libraries
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.Pipeline

object postsClassifier {
  
  def main(args: Array[String]){
    
  val conf = new SparkConf().setAppName("BinaryClassifier");  //need to create SparkConf for SparkContext
  val sc = new SparkContext(conf);                            //need to change the sc in shell to SparkContext in IDE
  
  val fileName = "hdfs://Master1:9000/xml/Posts.xml"
  val textFile = sc.textFile(fileName)
  val postsXml = textFile.map(_.trim).
   filter(!_.startsWith("<?xml version=")).
   filter(_ != "<posts>").
   filter(_ != "</posts>")
   
  val postsRDD = postsXml.map { s =>
     val xml = XML.loadString(s)
    
     val id = (xml \ "@Id").text
     val tags = (xml \ "@Tags").text
    
     val title = (xml \ "@Title").text
     val body = (xml \ "@Body").text
     val bodyPlain = ("<\\S+>".r).replaceAllIn(body, " ")
     val text = (title + " " + bodyPlain).replaceAll("\n", 
      " ").replaceAll("( )+", " ");
    
     Row(id, tags, text)
    } 
  
  val schemaString = "Id Tags Text"
  val schema = StructType(
     schemaString.split(" ").map(fieldName => 
      StructField(fieldName, StringType, true)))

  val sqlContext = new SQLContext(sc)                          //need to change the sqlContext in shell to SQLContext in IDE 
  val postsDf = sqlContext.createDataFrame(postsRDD, schema)
  
  postsDf.show()
  
  val targetTag = "java"
  val myudf: (String => Double) = (str: String) => 
      {if (str.contains(targetTag)) 1.0 else 0.0}

  val sqlfunc = udf(myudf)
  val postsLabeled = postsDf.withColumn("Label", sqlfunc(col("Tags")) )
  
  val positive = postsLabeled.filter("Label > 0.0")              //something is wrong here, need to check the DataFrame's filter method documents
  val negative = postsLabeled.filter("Label < 1.0")              //need to enclose the whole express with "", not just a ' in one side as the original codes 

  val positiveTrain = positive.sample(false, 0.9)
  val negativeTrain = negative.sample(false, 0.9)
  val training = positiveTrain.unionAll(negativeTrain)
  
  val negativeTrainTmp = negativeTrain
    .withColumnRenamed("Label", "Flag").select("Id", "Flag")  //need to enclose the whole express with "", not just a ' in one side as the original codes
  
  val negativeTest = negative.join(negativeTrainTmp, negative("Id") === negativeTrainTmp("Id"), "leftouter")  //need to change double == to triple ===
    .filter("Flag is null")
    .select(negative("Id"), negative("Tags"), negative("Text"), negative("Label"))                            //need to add dataframe name to all column names
  
  val positiveTrainTmp = positiveTrain
      .withColumnRenamed("Label", "Flag")
      .select("Id", "Flag")
  
  val positiveTest = positive.join( positiveTrainTmp, positive("Id") === positiveTrainTmp("Id"), "leftouter")  //need to change double == to triple ===
    .filter("Flag is null")
    .select(positive("Id"), positive("Tags"), positive("Text"), positive("Label"))                            //need to add dataframe name to all column names
  
  val testing = negativeTest.unionAll(positiveTest)
  
  val numFeatures = 64000
  val numEpochs = 30
  val regParam = 0.02

  val tokenizer = new Tokenizer().setInputCol("Text")
      .setOutputCol("Words")
  
  val hashingTF = new  org.apache.spark.ml.feature.HashingTF()
    .setNumFeatures(numFeatures)
    .setInputCol(tokenizer.getOutputCol)
    .setOutputCol("Features")
  
  val lr = new LogisticRegression().setMaxIter(numEpochs)
      .setRegParam(regParam).setFeaturesCol("Features")
      .setLabelCol("Label").setRawPredictionCol("Score")
      .setPredictionCol("Prediction")
  
  val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))
  
  val model = pipeline.fit(training)
  
  val testTitle = 
   "Easiest way to merge a release into one JAR file"
  
  //val tBoby = """Is there a tool or script which easily merges a bunch of href="http:/en.wikipedia.org/wiki/JAR_%28file_format %29" JAR files into one JAR file? A bonus would be to easily set the main-file manifest and make it executable. I would like to run it with something like: As far as I can tell, it has no dependencies which indicates that it shouldn't be an easy single-file tool, but the downloaded ZIP file contains a lot of libraries."""
  
  val testText = testTitle + """Is there a tool or script which easily merges a bunch of href="http:/en.wikipedia.org/wiki/JAR_%28file_format %29" JAR files into one JAR file? A bonus would be to easily set the main-file manifest and make it executable. I would like to run it with something like: As far as I can tell, it has no dependencies which indicates that it shouldn't be an easy single-file tool, but the downloaded ZIP file contains a lot of libraries."""
  
  val testDF = sqlContext
     .createDataFrame(Seq( (99.0, testText)))
     .toDF("Label", "Text")
  
  val result = model.transform(testDF)
  
  val prediction = result.collect()(0)(6)
     .asInstanceOf[Double]
  
  printf("Prediction: "+ prediction)
  
  val testingResult = model.transform(testing)
  
  val testingResultScores = testingResult
     .select("Prediction", "Label").rdd
     .map(r => (r(0).asInstanceOf[Double], r(1)
     .asInstanceOf[Double]))
  
  val bc = 
     new BinaryClassificationMetrics(testingResultScores)
  
  val roc = bc.areaUnderROC
  
  printf("Area under the ROC:" + roc)
    }
}

<--End-->