- 课程地址:尚硅谷大数据项目教程(大数据实战电商推荐系统)
- 尚硅谷电商推荐系统的配套资料及虚拟机
链接:https://pan.baidu.com/s/1iSMqV2wPkEfIsO1FrkxRNQ?pwd=1996
提取码:1996 - 1.项目体系架构设计(电商推荐系统)
- 2.工具环境搭建(电商推荐系统)
- 3.创建项目并初始化业务数据(电商推荐系统)
- 4.离线推荐服务建设(电商推荐系统)
- 5.实时推荐服务建设(电商推荐系统)
- 6.冷启动问题处理(电商推荐系统)
- 7.基于内容的相似推荐与基于物品的协同过滤推荐
- 8.尚硅谷电商推荐系统预览
一、离线推荐服务
离线推荐服务是综合用户所有的历史数据,利用设定的离线统计算法和离线推荐算法周期性的进行结果统计与保存,计算的结果在一定时间周期内是固定不变的,变更的频率取决于算法调度的频率。
离线推荐服务主要计算一些可以预先进行统计和计算的指标,为实时计算和前端业务相应提供数据支撑。
离线推荐服务主要分为统计推荐、基于隐语义模型的协同过滤推荐以及基于内容和基于Item-CF
的相似推荐。我们这一章主要介绍前两部分,基于内容和Item-CF
的推荐在整体结构和实现上是类似的,我们将在后面详细介绍。
二、离线统计服务
1.统计服务主体框架
统计服务主体框架包括:
- 历史热门商品统计
- 近期热门商品统计
- 优质商品统计
在recommender
下新建子项目StatisticsRecommender
,pom.xml
文件中只需引入spark
、scala
和mongodb
的相关依赖:
<dependencies>
<!-- Spark的依赖引入 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
</dependency>
<!-- 引入Scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
</dependency>
<!-- 加入MongoDB的驱动 -->
<!-- 用于代码方式连接MongoDB -->
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>casbah-core_2.11</artifactId>
<version>${casbah.version}</version>
</dependency>
<!-- 用于Spark和MongoDB的对接 -->
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>${mongodb-spark.version}</version>
</dependency>
</dependencies>
在resources
文件夹下引入log4j.properties
,然后在src/main/scala
下新建scala
单例对象com.atguigu.statistics.StatisticsRecommender
。
package com.atguigu.statistics
import org.apache.spark.SparkConf
import org.apache.spark.sql.{
DataFrame, SparkSession}
import java.text.SimpleDateFormat
import java.util.Date
/*
* Rating数据集
* 4867 用户ID
* 457976 商品ID
* 5.0 评分
* 1395676800 时间戳
* */
//定义样例类
case class Rating(userId: Int, productId: Int, score: Double, timestamp: Int)
/*
* MongoDB连接配置
* uri:MongoDB连接的uri
* db:要操作的db
* */
//将mongodb封装成样例类
case class MongoConfig(uri: String, db: String)
object StatisticsRecommender {
//定义要操作的表的名称(读取)
val MONGODB_RATING_COLLECTION = "Rating"
//定义统计的表的名称(写入)
val RATE_MORE_PRODUCTS = "RateMoreProducts" //历史热门商品统计表
val RATE_MORE_RECENTLY_PRODUCTS = "RateMoreRecentlyProducts" //最近热门商品统计表
val AVERAGE_PRODUCTS = "AverageProducts" //商品平均得分统计
def main(args: Array[String]): Unit = {
//配置信息
val config: Map[String, String] = Map(
"spark.cores" -> "local[*]",
"mongo.uri" -> "mongodb://localhost:27017/recommender",
"mongo.db" -> "recommender"
)
//构建sparkContext执行环境入口对象
val conf: SparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("StatisticsRecommender")
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
//声明一个隐式的配置对象,隐式参数--mongodb的连接配置
implicit val mongoConfig: MongoConfig = MongoConfig(config("mongo.uri"), config("mongo.db"))
//从Mongodb中加载Rating数据,并转化为样例类再转化为dataframe
val ratingDF: DataFrame = spark.read
.option("uri", mongoConfig.uri)
.option("collection", MONGODB_RATING_COLLECTION)
.format("com.mongodb.spark.sql")
.load()
.as[Rating]
.toDF()
//创建一张ratings的临时表
ratingDF.createOrReplaceTempView("ratings")
//TOOD:用spark sql去做不同的统计推荐
//1.历史热门商品,按照评分个数统计,最终得到productId,count
//通过Spark SQL读取评分数据集,统计所有评分中评分数最多的商品,然后按照从大到小排序,将最终结果写入MongoDB的RateMoreProducts数据集中。
val rateMoreProductsDF: DataFrame = spark.sql("select productId,count(productId) as count from ratings group by productId order by count desc")
storeDFInMongoDB(rateMoreProductsDF,RATE_MORE_PRODUCTS)
//2.近期热门商品,把时间戳转换为yyyyMM格式进行评分个数统计
//最终得到productId, count ,yearmonth
//创建一个日期格式化工具
val simpleDataFormat: SimpleDateFormat = new SimpleDateFormat("yyyyMM")
//注册UDF,将timestamp转化为年月格式,将秒数转化为毫秒数再转化为年月格式
spark.udf.register("changeDate", (x:Int) => simpleDataFormat.format(new Date(x * 1000L)).toInt)
// 将原来的Rating数据集中的时间转换成年月的格式
val ratingOfYearMonth = spark.sql("select productId, score, changeDate(timestamp) as yearmonth from ratings")
// 将新的数据集注册成为一张表
ratingOfYearMonth.createOrReplaceTempView("ratingOfMonth")
//按照yearmonth与productId进行聚合,再按照yearmonth与count进行降序排序
val rateMoreRecentlyProducts = spark.sql("select productId, count(productId) as count ,yearmonth from ratingOfMonth group by yearmonth,productId order by yearmonth desc, count desc")
//把df保存到mongodb
storeDFInMongoDB(rateMoreRecentlyProducts,RATE_MORE_RECENTLY_PRODUCTS)
//3.优质商品统计,商品的平均评分,最终得到productId与avg
val avergerProductsDF: DataFrame = spark.sql("select productId,avg(score) as avg from ratings group by productId order by avg desc")
storeDFInMongoDB(avergerProductsDF,AVERAGE_PRODUCTS)
spark.stop()
}
def storeDFInMongoDB(df: DataFrame,collection_name:String)(implicit mongoConfig: MongoConfig): Unit ={
df
.write
.option("uri",mongoConfig.uri)
.option("collection",collection_name)
.mode("overwrite")
.format("com.mongodb.spark.sql")
.save()
}
}
运行结束后,从MongoDB中查看结果:
C:\Users\admin>mongo
MongoDB shell version v4.2.21
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session {
"id" : UUID("68bc0cf1-5482-40e4-b7ab-b9517874077b") }
MongoDB server version: 4.2.21
Server has startup warnings:
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten]
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten]
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** WARNING: This server is bound to localhost.
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** Remote systems will be unable to connect to this server.
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** Start the server with --bind_ip <address> to specify which IP
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** addresses it should serve responses from, or with --bind_ip_all to
2022-08-03T00:52:25.812+0800 I CONTROL [initandlisten] ** bind to all interfaces. If this behavior is desired, start the
2022-08-03T00:52:25.813+0800 I CONTROL [initandlisten] ** server with --bind_ip 127.0.0.1 to disable this warning.
2022-08-03T00:52:25.813+0800 I CONTROL [initandlisten]
---
Enable MongoDB's free cloud-based monitoring service, which will then receive and display
metrics about your deployment (disk utilization, CPU, operation statistics, etc).
The monitoring data will be available on a MongoDB website