Spark UDF实例详解

需求List:
(1) 统计orders中produce的数量统计
– 商品被购买的数据量
hive : group by count
spark:
scala> val orders=sql(“select * from badou.orders”)
scala> val products=sql(“select * from badou.products”)
scala> val priors=sql(“select * from badou.priors”)
scala> priors.groupBy(“product_id”).count()
res6: org.apache.spark.sql.DataFrame = [product_id: string, count: bigint]
方式一:
scala> res6.show(10)
方式二:
scala> priors.groupBy(“product_id”).count().show(10)
方式三:
scala> val proCnt = priors.groupBy(“product_id”).count()
proCnt.show(10)
show(): 默认显示20条
show(10): 显示指定的条数
show(1,false) 显示的记录数 和针对字符过长进行格式化显示
方式四:
scala> priors.groupByKey(row=>{row.getString(1)}).count().show(10)
多个分组: groupBy(“age”,“name”)
常使用: groupBy
row : 就是一行数据
cache:
scala> val proCnt = priors.groupBy(“product_id”).count().cache 没有执行,只是加载到内存中
scala> proCnt.show(10) 原计划运行
proCnt.show(10) 直接内存中读取
scala> proCnt.unpersist 内存中直接移除
proCnt.show(10)
(2) 统计produce被 reordered的数量(再次购买)
product_id 进行group by , sum(reordered) 的值
场景:当一个商品被重复购买,重复购买的比率越高(这类商品可以理解为消耗品,抽纸,洗发水等等),那下一次购买的可能性很高
预测:购买这些商品的用户,下一次最容易购买哪些商品
filter: 针对集合中的元素进行过滤, <> where
scala> orders.filter(col(“eval_set”)=“test”).show(5)
scala> orders.where(col(“eval_set”)===“test”).show(5)
scala> orders.filter(col(“eval_set”)=“test”).filter(col(“order_dow”)=“1”).show(10)
select: 进行列的方式处理
selectExpr: 处理字符串表达式,直接写SQL语句
scala> orders.select("*").show(10)
scala> orders.select(col(“order_id”), col(“order_number”)).show(10)
方式一:
scala> priors.selectExpr(“product_id”,“cast(reordered as int)”)
.filter(col(“reordered”)===1)
.groupBy(“product_id”).count()
方式二:
priors.selectExpr(“product_id”,“cast(reordered as int)”).filter(col(“reordered”)===1).groupBy(“product_id”).sum()
方式三:
scala> priors.selectExpr(“product_id”,“cast(reordered as int)”)
.groupBy(“product_id”).sum(“reordered”)
方式四:
scala> priors.selectExpr(“product_id”,“cast(reordered as int)”)
.groupBy(“product_id”).agg(sum(“reordered”))
agg 一般搭配group by 这种聚合函数使用和sum区别,在一次聚合中可以统计多个值,sum, avg ,max,min
priors.selectExpr(“product_id”,“cast(reordered as int)”).groupBy(“product_id”).agg(sum(“reordered”),avg(“reordered”)).show(5)
(3) 结合上面的 统计被重复购买的比率 avg(“reordered”)
公式: 重复购买的商品量 / 总的商品量 => sum / count
字段重命名 hive as spark
scala> priors.selectExpr(“product_id”,“cast(reordered as int)”).groupBy(“product_id”)
.agg(sum(“reordered”),avg(“reordered”))
.withColumnRenamed(“sum(reordered)”,“sum_re”)
.show(5)
a、重复购买的商品量
val productSumRe = priors.selectExpr(“product_id”,“cast(reordered as int)”).groupBy(“product_id”).agg(sum(“reordered”),avg(“reordered”)).withColumnRenamed(“sum(reordered)”,“sum_re”)
b、总的商品量
val proCnt = priors.groupBy(“product_id”).count()
方式一: scala
priors.selectExpr(“product_id”,“cast(reordered as int)”).groupBy(“product_id”).agg(sum(“reordered”),avg(“reordered”)).show(5)
方式二: SQL
scala> val jCnt = proCnt.join(productSumRe, “product_id”)
jCnt.selectExpr("*", “sum_re / count as mean_re”).show(5)
方式三: udf
import org.apache.spark.sql.functions._
scala> val avg_udf = udf((sm:Long,cnt:Long)=>sm.toDouble/cnt.toDouble)
scala> jCnt.withColumn(“mean_re”, avg_udf(col(“sum_re”),col(“count”))).show(5)