spark 基础八 MLlib

最新推荐文章于 2020-03-01 21:03:10 发布

weixin_40988315

最新推荐文章于 2020-03-01 21:03:10 发布

阅读量247

点赞数

CC 4.0 BY-SA版权

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/weixin_40988315/article/details/82251503

spark 专栏收录该内容

20 篇文章

订阅专栏

本文介绍了如何使用Spark的MLlib库创建一个垃圾邮件分类器。通过HashingTF将文本转换为特征向量，然后利用LogisticRegressionWithSGD进行训练。还讨论了密集向量和稀疏向量的概念。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MLlib是spark中提供机器学习函数的库。它是专为在集群上并行运行的情况而设计的，设计理念非常简单:把数据以RDD形式表示，然后在分布式数据集上调用各种算法。可以将其看作RDD上一系列可供调用的函数的集合。MLlib中只包含能够在集群上运行良好的并行算法。MLlib的算法适用于大规模数据集，如果要在许多小规模数据集上训练各机器学习模型，最好还是在各节点上使用单节点的机器学习算法库实现，比如spark的map操作在各节点上并行使用。

#python 垃圾邮件分类器

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD

spam=sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")

#create a HahsingTF instance to map the text to a 10000 vector
tf=HashingTF(numFeatures=10000)

#split the text to words and transform to feature vector
spamFeatures=spam.map(lambda email : tf.transform(email.split("")))
normalFeatures=normal.map(lambda email : tf.transfrom(email.split("")))

#create dataset to save the positive and negative examples
positiveExamples = spamFeatures.map(lambda faetures: LabeledPoint(1,features))
negativeExamples = normalFeatures.map(lambda features：LabeledPoint(0,features))

trainingData = positiveExamples.union(negativeExamples)
trainingData.cache() # case the logic is iteration method so the dataset should be cache

#logistic with SGD
model = LogisticRegressionWithSGD.train(trainingdata)

#test

negTest = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
posTest=tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))

print model.predict(posTest)
print model.predict(negTest)

向量

稠密向量所有维度的值都保存稀疏向量只保存非零向量

#python

desevec1=array([1,2,3])

densevec2=Vectors.dense([1,2,3])

sparseVec1=Vectors.sparse(4,{0；1.0，2：2.0})

sparseVec2 = Vectors.sparse(4,[0,2],[1.0,2.0])

#Java

Vector denseVec1=Vectors.dense(1,2,3);

Vector denseVec2=Vectors.dense(new double[] {1,2,3});

Vector sparseVec1= Vectors.sparse(4, new int[]{1,2}, new double[]{1.0,2.0});