还是转译KDNuggets的文章。微软的Dmitry Petrov介绍的如何用Spark ML来处理超过内存大小的数据。原文的 Link
这里侧重的是数据的大小远远超过单机的内存大小。原来这样的分析都是要用分布式的系统(比如hadoop)上来实现,而这篇文章里介绍的是单机如何通过Spark来实现分析。不过自己做了很多的migration,所以就算是原创啦。
本文所要介绍的案例的目的是要建立一个预测模型来基于帖子的标题和内容来预测一个帖子的标签(Tag)。处于简化代码的目的,文章里会把这两field组合成一个文字列来处理,而不是分别处理。(译者注:很明显,标题里的文字对于预测标签的权重应该更大,所以现实工作中,我们应该是分别对待这两个列)。
很容易理解这个预测模型对于stackoverflow.com这样的网站的价值。用户输入一个问题,网站会自动的给出标签的建议。假定我们需要尽可能多的正确的标签,这样用户可以删掉那些不相关的标签。基于这样的假定,我们就可以使用recall来作为检验模型好坏的最重要的依据了。
首先是要找这样的一个数据,文章里用的是在aXive上的stackflow的posts.xml文件,链接是https://archive.org/details/stackexchange。同时作者也提供了一个小文件来给大家做练习,链接在https://www.dropbox.com/s/n2skgloqoadpa30/Posts.small.xml?dl=0,(需要注意的是,是国内访问不了这两个网站,所以我把第二个小文件放到云盘里面供下载,地址:http://pan.baidu.com/s/1jGJFtQI,链接。第一个文件需要大家自己翻墙去找)
下面要做的事情就是配置Spark环境。原文的例子是跑在Spark 1.5.1的环境下,我实际用的是1.5.2的配合Hadoop 2.6。具体的安装步骤参考另外一个博文:hadoop集群的搭建脚本及构思(N):一个简化的Hadoop+Spark on Yarn集群快速搭建,
原文是在spark-shell里面的一堆基于scala的命令。出于一个软件工程师的偏好,我把这些命令放到了一个Eclipse的scala的项目里来实现,又碰到一堆坑。下面直接上代码,需要修改的地方都用注释来解释了。
然后Export成一个jar包,用spark-submit放进去跑了。结果还碰到一个诡异的问题,不过已经解决了。见其他的博文。
//import scala.xml._
// Spark data manipulation libraries
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark._
// Spark machine learning libraries
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.Pipeline
object postsClassifier {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("BinaryClassifier"); //need to create SparkConf for SparkContext
val sc = new SparkContext(conf); //need to change the sc in shell to SparkContext in IDE
val fileName = "hdfs://Master1:9000/xml/Posts.xml"
val textFile = sc.textFile(fileName)
val postsXml = textFile.map(_.trim).
filter(!_.startsWith("<?xml version=")).
filter(_ != "<posts>").
filter(_ != "</posts>")
val postsRDD = postsXml.map { s =>
val xml = XML.loadString(s)
val id = (xml \ "@Id").text
val tags = (xml \ "@Tags").text
val title = (xml \ "@Title").text
val body = (xml \ "@Body").text
val bodyPlain = ("<\\S+>".r).replaceAllIn(body, " ")
val text = (title + " " + bodyPlain).replaceAll("\n",
" ").replaceAll("( )+", " ");
Row(id, tags, text)
}
val schemaString = "Id Tags Text"
val schema = StructType(
schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
val sqlContext = new SQLContext(sc) //need to change the sqlContext in shell to SQLContext in IDE
val postsDf = sqlContext.createDataFrame(postsRDD, schema)
postsDf.show()
val targetTag = "java"
val myudf: (String => Double) = (str: String) =>
{if (str.contains(targetTag)) 1.0 else 0.0}
val sqlfunc = udf(myudf)
val postsLabeled = postsDf.withColumn("Label", sqlfunc(col("Tags")) )
val positive = postsLabeled.filter("Label > 0.0") //something is wrong here, need to check the DataFrame's filter method documents
val negative = postsLabeled.filter("Label < 1.0") //need to enclose the whole express with "", not just a ' in one side as the original codes
val positiveTrain = positive.sample(false, 0.9)
val negativeTrain = negative.sample(false, 0.9)
val training = positiveTrain.unionAll(negativeTrain)
val negativeTrainTmp = negativeTrain
.withColumnRenamed("Label", "Flag").select("Id", "Flag") //need to enclose the whole express with "", not just a ' in one side as the original codes
val negativeTest = negative.join(negativeTrainTmp, negative("Id") === negativeTrainTmp("Id"), "leftouter") //need to change double == to triple ===
.filter("Flag is null")
.select(negative("Id"), negative("Tags"), negative("Text"), negative("Label")) //need to add dataframe name to all column names
val positiveTrainTmp = positiveTrain
.withColumnRenamed("Label", "Flag")
.select("Id", "Flag")
val positiveTest = positive.join( positiveTrainTmp, positive("Id") === positiveTrainTmp("Id"), "leftouter") //need to change double == to triple ===
.filter("Flag is null")
.select(positive("Id"), positive("Tags"), positive("Text"), positive("Label")) //need to add dataframe name to all column names
val testing = negativeTest.unionAll(positiveTest)
val numFeatures = 64000
val numEpochs = 30
val regParam = 0.02
val tokenizer = new Tokenizer().setInputCol("Text")
.setOutputCol("Words")
val hashingTF = new org.apache.spark.ml.feature.HashingTF()
.setNumFeatures(numFeatures)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("Features")
val lr = new LogisticRegression().setMaxIter(numEpochs)
.setRegParam(regParam).setFeaturesCol("Features")
.setLabelCol("Label").setRawPredictionCol("Score")
.setPredictionCol("Prediction")
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
val testTitle =
"Easiest way to merge a release into one JAR file"
//val tBoby = """Is there a tool or script which easily merges a bunch of href="http:/en.wikipedia.org/wiki/JAR_%28file_format %29" JAR files into one JAR file? A bonus would be to easily set the main-file manifest and make it executable. I would like to run it with something like: As far as I can tell, it has no dependencies which indicates that it shouldn't be an easy single-file tool, but the downloaded ZIP file contains a lot of libraries."""
val testText = testTitle + """Is there a tool or script which easily merges a bunch of href="http:/en.wikipedia.org/wiki/JAR_%28file_format %29" JAR files into one JAR file? A bonus would be to easily set the main-file manifest and make it executable. I would like to run it with something like: As far as I can tell, it has no dependencies which indicates that it shouldn't be an easy single-file tool, but the downloaded ZIP file contains a lot of libraries."""
val testDF = sqlContext
.createDataFrame(Seq( (99.0, testText)))
.toDF("Label", "Text")
val result = model.transform(testDF)
val prediction = result.collect()(0)(6)
.asInstanceOf[Double]
printf("Prediction: "+ prediction)
val testingResult = model.transform(testing)
val testingResultScores = testingResult
.select("Prediction", "Label").rdd
.map(r => (r(0).asInstanceOf[Double], r(1)
.asInstanceOf[Double]))
val bc =
new BinaryClassificationMetrics(testingResultScores)
val roc = bc.areaUnderROC
printf("Area under the ROC:" + roc)
}
}
<--End-->