Spark MLlib垃圾邮件分类示例

最新推荐文章于 2024-07-06 13:33:24 发布

原创最新推荐文章于 2024-07-06 13:33:24 发布 · 1.9k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#Spark

大数据同时被 2 个专栏收录

76 篇文章

订阅专栏

机器学习

57 篇文章

订阅专栏

本文实践了《Spark快速大数据分析》中的机器学习部分，通过自备数据集，利用Spark MLlib库进行邮件分类任务，对比了逻辑回归与SVM算法的效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文是对《Spark快速大数据分析》中Spark机器学习相关内容的一个实践（其中主要代码也是来自该文中的示例代码），只是自己准备了数据，并实际运行体验。

本文数据下载：https://download.youkuaiyun.com/download/wiborgite/10739730

本文使用scala实现，在spark-shell中即可执行，代码如下所示：

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.classification.SVMWithSGD

val spam = sc.textFile("/tmp/spam.txt")
val normal = sc.textFile("/tmp/normal.txt")


// 创建一个HashingTF实例来把邮件文本映射为包含10000个特征的向量
val tf = new HashingTF(numFeatures = 10000)
// 各邮件都被切分为单词，每个单词被映射为一个特征
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal.map(email => tf.transform(email.split(" ")))


// 创建LabeledPoint数据集分别存放阳性（垃圾邮件）和阴性（正常邮件）的例子
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = normalFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples.union(negativeExamples)
trainingData.cache() // 因为逻辑回归是迭代算法，所以缓存训练数据RDD


// 使用SGD算法运行逻辑回归
val model = new LogisticRegressionWithSGD().run(trainingData)


//使用SVM算法
val model = new SVMWithSGD().run(trainingData)


// 以阳性（垃圾邮件）和阴性（正常邮件）的例子分别进行测试
val posTest = tf.transform(
"O M G GET cheap stuff by sending money to ...".split(" "))
val negTest = tf.transform(
"Hi Dad, I started studying Spark the other ...".split(" "))
println("Prediction for positive test example: " + model.predict(posTest))
println("Prediction for negative test example: " + model.predict(negTest))