MLlib是spark中提供机器学习函数的库。它是专为在集群上并行运行的情况而设计的,设计理念非常简单:把数据以RDD形式表示,然后在分布式数据集上调用各种算法。可以将其看作RDD上一系列可供调用的函数的集合。MLlib中只包含能够在集群上运行良好的并行算法。MLlib的算法适用于大规模数据集,如果要在许多小规模数据集上训练各机器学习模型,最好还是在各节点上使用单节点的机器学习算法库实现,比如spark的map操作在各节点上并行使用。
#python 垃圾邮件分类器
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD
spam=sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")
#create a HahsingTF instance to map the text to a 10000 vector
tf=HashingTF(numFeatures=10000)
#split the text to words and transform to feature vector
spamFeatures=spam.map(lambda email : tf.transform(email.split("")))
normalFeatures=normal.map(lambda email : tf.transfrom(email.split("")))
#create dataset to save the positive and negative examples
positiveExamples = spamFeatures.map(lambda faetures: LabeledPoint(1,features))
negativeExamples = normalFeatures.map(lambda features:LabeledPoint(0,features))
trainingData = positiveExamples.union(negativeExamples)
trainingData.cache() # case the logic is iteration method so the dataset should be cache
#logistic with SGD
model = LogisticRegressionWithSGD.train(trainingdata)
#test
negTest = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
posTest=tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
print model.predict(posTest)
print model.predict(negTest)
向量
稠密向量 所有维度的值都保存 稀疏向量 只保存非零向量
#python
desevec1=array([1,2,3])
densevec2=Vectors.dense([1,2,3])
sparseVec1=Vectors.sparse(4,{0;1.0,2:2.0})
sparseVec2 = Vectors.sparse(4,[0,2],[1.0,2.0])
#Java
Vector denseVec1=Vectors.dense(1,2,3);
Vector denseVec2=Vectors.dense(new double[] {1,2,3});
Vector sparseVec1= Vectors.sparse(4, new int[]{1,2}, new double[]{1.0,2.0});