最近公司在弄推荐系统,抽时间学习了下。感觉这东西很强大。一般公司还真用不着,对数据的要求比较高(至少20W以上,否则不太准)。
下面是我对spark自带的电影推荐系统例子进行个人理解:
使用算法:ALS交替最小二乘法
电影数据列表(电影ID和电影名称、演片类型)
用户评分列表(用户ID,电影ID,评分,评分时间)
需推荐的测试数据
step1:加载
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)//设置日志级别
Logger.getLogger("org.apache.eclipse.jetty.server").setLevel(Level.OFF)
val sparkConf = new SparkConf().setAppName("MovieLensALS").setMaster("local[5]").set("spark.executor.memory", "2g")
val sc = new SparkContext(sparkConf)
//测试数据加载,过滤出评分大于0的,转换为ratingRDD=>Rating(用户,电影,评分)
val myRatings = loadRatings("F:/datafile/test.dat")
val myRatingsRDD = sc.parallelize(myRatings, 1)
def loadRatings(path: String): Seq[Rating] = {
val lines = Source.fromFile(path).getLines()
val ratings = lines.map { line =>
val fields = line.split("::")
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
}.filter(_.rating > 0.0)
if (ratings.isEmpty) {
sys.error("No ratings provided.")
} else {
ratings.toSeq
}
}
//加载用户评分列表和电影评分列表
val ratings = sc.textFile(new File("F:/datafile/ratings.dat").toString).map {
line =>val fields = line.split("::")
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}//按时间取余分组
val movies = sc.textFile(new File("F:/datafile/movies.dat").toString).map { line =>
val fields = line.split("::")
(fields(0).toInt, fields(1))
}.collect().toMap
step2:显示数据情况
用户评分列表
val numRatings = ratings.count() //总行数
val numUsers = ratings.map(_._2.user).distinct().count()//涉及用户数
val numMovies = ratings.map(_._2.product).distinct().count()//涉及电影数
println("Got " + numRatings + " ratings from "
+ numUsers + " users on " + numMovies + " movies.")
step3:数据集合分块
将样本评分表以key值切分成3个部分,分别用于训练 (60%), 校验 (20%), 测试 (20%),并进行缓存
val numPartitions = 4 //分区数
val training = ratings.filter(x => x._1 < 6).values.union(myRatingsRDD).repartition(numPartitions).persist()//把60%的数据合上原始数据
val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8).values.repartition(numPartitions).persist()//校验数据集合
val test = ratings.filter(x => x._1 >= 8).values.persist()//测试庶几乎集合
val numTraining = training.count()//训练总集合数
val numValidation = validation.count()//校验总集合数
val numTest = test.count()//测试总集合数
println("Training: " + numTraining + " validation: " + numValidation + " test: " + numTest)
step4:参数配置
val ranks = List(8, 12)
val lambdas = List(0.1, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
//平方根误差
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))//通过模型预测出用户对电影的评分
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating)).join(data.map(x => ((x.user, x.product), x.rating))).values
//把预测的和现有的 一对一关联返回
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)//把预测的和现在的进行求平方根误差
}
//算出最佳rating
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
println("RMSE (validation) = " + validationRmse + " for the model trained with rank = " + rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
} //通过方差大小比较去除最小的(即最好的)
}
step5:推荐电影
排除测试数据中电影,并在模型中推荐出前10电影
val myRatedMovieIds = myRatings.map(_.product).toSet
val candidates = sc.parallelize(movies.keys.filter(!myRatedMovieIds.contains(_)).toSeq)//过滤掉测试电影数据
val recommendations = bestModel.get
.predict(candidates.map((0, _))) //把电影带进去预测
.collect
.sortBy(-_.rating)
.take(10)
//这里感觉有点不好的地方,应该给不同的用户推荐不同的电影
var i = 1
println("Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
} //打印推荐的电影
总体来说写的挺好的,里面包含了很多东西。
不好的地方就是:应该根据你个人看过的电影,为你推荐你喜欢的电影.