1. org.apache.spark.ml.recommendation.ALS推荐出来的结果虽然是排序了的,但是没有排序号;想知道推荐成功与推荐排名的关系需要自己加上Row_Number,方法如下:
val recDF=spark.sqlContext.read.load(savePathMl)
.selectExpr("id","explode(recommendations) as rec").selectExpr("id as uId","rec.itemId","rec.rating as rec_rating")
recDF.createOrReplaceTempView("recommend")
spark.sql("select uId,itemId,rec_rating,Row_Number() OVER (partition by uId order by rec_rating desc) as rank from recommend ")
如果想要得分相同的时候并列名次则考虑用rank over ()或者dense_rank()替代row_number(),区别在于:
rank over ()特点是得分相同的两名是并列,如下1 2 2 4 5
dense_rank()和rank over()很像,但并列后并不会空出并列所占的名次,如下1 2 2 3 4
row_number这个函数不需要考虑是否并列,那怕根据条件查询出来的得分相同也会进行连续排名,如1 2 3 4 5
参考:https://blog.youkuaiyun.com/zz_xiaohuli_zz/article/details/87472176
注:partition by colName 是可选可省略的
2. 展开dataframe逗号分隔符集合字段info:
val genresArr = (s:String) => s.split(",")
spark.udf.register("getGenresArr",genresArr(_:String))
select uId,explode(getGenresArr(info)) as info from table
3. 歌曲特征集合collect_list是Array,在udf中必须用Seq传参(用Array传参会报错):
def getArtistFeature(songsFeature:Seq[Seq[Float]]): Array[Float] ={
var artFeature: DenseMatrix[Float] =DenseMatrix(Array(0F,0F,0F,0F,0F, 0F,0F,0F,0F,0F))
songsFeature.foreach(songFeature => artFeature += DenseMatrix(songFeature) )
( artFeature * (1L/songsFeature.length.toFloat) ).toArray
}
spark.udf.register("getArtistFeature",getArtistFeature(_:Seq[Seq[Float]]))
val artDF=spark.sql(
"""
|select artistId as id,getArtistFeature(collect_list(features)) as features
|from musicInfo t1
|group by artistId
""".stripMargin)