直接上数据格式、问题、代码。先按照Java来复习下。复习完再复习Scala
一、数据准备、格式介绍.
1、评分数据。数据为 UserId::MovieId::Rating::timestamp. 文件名rating.bat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
2、电影数据。数据为 MovieID::Title::Gengres
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
3、用户数据。数据为 UserId::Gender::Age::Occupation::Zip-code
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
6::F::50::9::55117
二、解决问题
1、打印所有电影中评分最高的前十个电影名和平均分。
解答代码:(测试完毕)
/**
* 电光火石体验Spark
* 1、所有电影中评分最高的前十名
* 2、前十名的平均评分
* 3、带出电影名字
*/
public class Spark_Analyzer {
public static void main(String[] args) {
String basicPath = "spark-demo/src/main/resources/";
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local");
sparkConf.setAppName("Spark_Analyzer");
sparkConf.set("spark.testing.memory", "2147480000");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
javaSparkContext.setLogLevel("trace");
JavaRDD<String> userRdd = javaSparkContext.textFile(basicPath + "users.dat");
JavaRDD<String> moviesRdd = javaSparkContext.textFile(basicPath + "movies.dat", 3);
JavaRDD<String> ratingsRdd = javaSparkContext.textFile(basicPath + "ratings.dat", 3);
JavaPairRDD<String, String> movieCache = moviesRdd.mapToPair(line -> new Tuple2(line.split("::")[0], line.split("::")[1])).cache();
JavaPairRDD<String, Integer> avgRatings = ratingsRdd.mapToPair(new PairFunction<String, String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Tuple2<String, Integer>> call(String line) throws Exception {
String[] split = line.split("::");
return new Tuple2(split[1], new Tuple2(split[2], 1));
}
}).reduceByKey(new Function2<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> call(Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) throws Exception {
Integer rating = Integer.parseInt(t1._1.toString()) + Integer.parseInt(t2._1.toString());
int num = t1._2 + t2._2;
return new Tuple2(rating.toString(), num);
}
}).mapToPair(new PairFunction<Tuple2<String, Tuple2<String, Integer>>, String, Integer>() {
@Override
public Tupl

本文详细介绍如何使用Apache Spark处理电影评分数据,包括计算电影平均评分、筛选最受欢迎的电影以及基于性别偏好进行电影推荐。通过具体代码示例展示了数据清洗、RDD转换操作及二次排序等关键步骤。
最低0.47元/天 解锁文章
6142

被折叠的 条评论
为什么被折叠?



