上一篇讲到
val (userInBlocks, userOutBlocks) =
makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel)
(userInBlocks, userOutBlocks)就是上篇最终结果inBlock,outBlock,继续看
// materialize blockRatings and user blocks//实例化blockRatings 和用户分块
userOutBlocks.count()//记录OutBlocks分块数,一般是10,除非没有用户id落到某分块
val swappedBlockRatings = blockRatings.map {//交换blockRatings成员位置,swappedBlockRatings 最终的数据结构是key-value,key是(itemBlockId, userBlockId)即产品BlockId,用户BlockId,value是RatingBlock(itemIds, userIds, localRatings),itemIds是产品id数组,userIds是用户id数组,localRatings是评分数组
case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, localRatings)) =>
((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, localRatings))
}
val (itemInBlocks, itemOutBlocks) =//以产品为维度做一个makeBlocks,这和第二篇讲的以用户为维度做makeBlocks完全一样
makeBlocks("item", swappedBlockRatings, itemPart, userPart, intermediateRDDStorageLevel)
// materialize item blocks//实例化产品blocks
itemOutBlocks.count()//这里并没有赋值,是想通过count()方法改变rdd依赖关系,只保留itemOutBlocks这个rdd
val seedGen = new XORShiftRandom(seed)//随机数生成器
我们先看下initialize方法
/**
* Initializes factors randomly given the in-link blocks.//根据传入的inblocks随机初始化因子
*
* @param inBlocks in-link blocks//inblocks
* @param rank rank//矩阵阶数
* @return initialized factor blocks//返回初始化的因子分块
*/
private def initialize[ID](
inBlocks: RDD[(Int, InBlock[ID])],
rank: Int,
seed: Long): RDD[(Int, FactorBlock)] = {
// Choose a unit vector uniformly at random from the unit sphere, but from the//从单位球面随机选择一个单位向量,但是第一象限的元素都是非负的,这可以通过选择标准正态分布的数并取绝对值,再归一化来完成,这样创建因子比从01之前选随机数要好
// "first quadrant" where all elements are nonnegative. This can be done by choosing
// elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
// This appears to create factorizations that have a slightly better reconstruction
// (<1%) compared picking elements uniformly at random in [0,1].
inBlocks.map { case (srcBlockId, inBlock) =>//操作传入的inBlocks
val random = new XORShiftRandom(byteswap64(seed ^ srcBlockId))//随机数生成器
val factors = Array.fill(inBlock.srcIds.length) {生成以用户id数量为长度的数组
val factor = Array.fill(rank)(random.nextGaussian().toFloat)//生成以矩阵阶数为长度,值服从高斯分布的数组
val nrm = blas.snrm2(rank, factor, 1)//将factor标准化即长度为1,并返回
blas.sscal(rank, 1.0f / nrm, factor, 1)
factor
}
(srcBlockId, factors)//返回srcBlockId,factors,factors相当于对每个用户建立了一个矩阵阶数的数组,且数组长度为1
}
}
var userFactors = initialize(userInBlocks, rank, seedGen.nextLong())//根据刚才讲的构建用户因子和产品因子
var itemFactors = initialize(itemInBlocks, rank, seedGen.nextLong())
var previousCheckpointFile: Option[String] = None//初始化checkpointFile为None
val shouldCheckpoint: Int => Boolean = (iter) =>
sc.checkpointDir.isDefined && checkpointInterval != -1 && (iter % checkpointInterval == 0)//判断是否需要Checkpoint,如果有Checkpoint的目录,Checkpoint间隔不为-1,迭代次数达到间隔就进行Checkpoint,默认迭代10次进行Checkpoint
val deletePreviousCheckpointFile: () => Unit = () =>//如果有旧的CheckpointFile就删除
previousCheckpointFile.foreach { file =>
try {
FileSystem.get(sc.hadoopConfiguration).delete(new Path(file), true)
} catch {
case e: IOException =>
logWarning(s"Cannot delete checkpoint file $file:", e)
}
}
if (implicitPrefs) {//如果是隐式属性
for (iter <- 1 to maxIter) {//一直迭代
userFactors.setName(s"userFactors-$iter").persist(intermediateRDDStorageLevel)//缓存userFactors
val previousItemFactors = itemFactors//把itemFactors赋值给previousItemFactors
itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
userLocalIndexEncoder, implicitPrefs, alpha, solver)//终于开始计算了,我们先看下computeFactors
/**
* Compute dst factors by constructing and solving least square problems.//通过构建并解决最小平方问题计算用户因子对应的的产品因子
*
* @param srcFactorBlocks src factors//用户因子
* @param srcOutBlocks src out-blocks//userOutBlocks
* @param dstInBlocks dst in-blocks//itemInBlocks
* @param rank rank//矩阵阶数
* @param regParam regularization constant//正则化参数
* @param srcEncoder encoder for src local indices//用户本地索引编码器
* @param implicitPrefs whether to use implicit preference//是否采用隐式属性
* @param alpha the alpha constant in the implicit preference formulation//隐式属性公式参数
* @param solver solver for least squares problems//采用哪种方法解决最小平方问题
*
* @return dst factors//返回和用户因子对应的产品因子
*/
private def computeFactors[ID](//以用户因子为例,srcFactorBlocks存储了用户分块id及这个分块下每个用户对应的阶长数组,srcOutBlocks存储了用户分块id及和这个分块有链接的各产品分块对应的用户本地索引数组,dstInBlocks存储了产品分块及唯一产品id数组,和唯一产品id数组对应用户的累计条数数组,用户编码数组,产品对用户得分
srcFactorBlocks: RDD[(Int, FactorBlock)],
srcOutBlocks: RDD[(Int, OutBlock)],
dstInBlocks: RDD[(Int, InBlock[ID])],
rank: Int,
regParam: Double,
srcEncoder: LocalIndexEncoder,
implicitPrefs: Boolean = false,
alpha: Double = 1.0,
solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)] = {
val numSrcBlocks = srcFactorBlocks.partitions.length//取srcFactorBlocks分区数量
val YtY = if (implicitPrefs) Some(computeYtY(srcFactorBlocks, rank)) else None//如果是隐式属性则计算格拉姆矩阵,我们看下computeYtY方法
/**
* Computes the Gramian matrix of user or item factors, which is only used in implicit preference.//根据用户或产品因子计算格拉姆矩阵,这里应该是针对隐式属性,缓存输入的因子
* Caching of the input factors is handled in [[ALS#train]].
*/
private def computeYtY(factorBlocks: RDD[(Int, FactorBlock)], rank: Int): NormalEquation = {//还以输入的因子为用户因子,阶数为例,返回正规方程
factorBlocks.values.aggregate(new NormalEquation(rank))(//只取factorBlocks的value部分,以新的正规方程为初始值进行聚合,相当于以每个用户数组为列向量乘以它的转置,得到的矩阵再叠加,只不过NormalEquation是以上三角向量的方式存储的对称矩阵
seqOp = (ne, factors) => {
factors.foreach(ne.add(_, 0.0))
ne
},
combOp = (ne1, ne2) => ne1.merge(ne2))
}
回到computeFactors
val srcOut = srcOutBlocks.join(srcFactorBlocks).flatMap {//srcOutBlocks和srcFactorBlocks的信息合并,key是用户分块,value是两个二维数组组,有用户数组,外链产品分块的用户本地索引数组
case (srcBlockId, (srcOutBlock, srcFactors)) =>
srcOutBlock.view.zipWithIndex.map { case (activeIndices, dstBlockId) =>//搞出产品blockid
(dstBlockId, (srcBlockId, activeIndices.map(idx => srcFactors(idx))))//通过用户本地索引找出用户id因子值
}
}//返回key是产品分块id,value的key是用户分块id,value的value是这个产品对应用户分块里每个产品分块id对应的用户因子数组
val merged = srcOut.groupByKey(new ALSPartitioner(dstInBlocks.partitions.length))//把刚才的数据结构转成key对应数组的形式
dstInBlocks.join(merged).mapValues {//把dstInBlocks和merged合并,处理value
case (InBlock(dstIds, srcPtrs, srcEncodedIndices, ratings), srcFactors) =>//这数据结构真够复杂的
val sortedSrcFactors = new Array[FactorBlock](numSrcBlocks)//生成一个存储 srcFactorBlocks的数组
srcFactors.foreach { case (srcBlockId, factors) =>
sortedSrcFactors(srcBlockId) = factors//把刚才groupByKey的结果存成数组,最终sortedSrcFactors是三维数组,第一维是产品blockid,第二维是用户blockid,第三维是用户因子数组
}
val dstFactors = new Array[Array[Float]](dstIds.length)//又声明了一个二维数组
var j = 0
val ls = new NormalEquation(rank)//声明正规方程
while (j < dstIds.length) {//遍历产品id
ls.reset()//清空正规方程
if (implicitPrefs) {//如果使用隐式属性
ls.merge(YtY.get)//把YtY加到ls上
}
var i = srcPtrs(j)//产品id对应用户数组起始索引
var numExplicits = 0//每个产品关联用户的数量,即评分次数
while (i < srcPtrs(j + 1)) {//产品id对应用户数组终止索引
val encoded = srcEncodedIndices(i)//产品id对应用户的编码
val blockId = srcEncoder.blockId(encoded)//产品id对应用户的blockid
val localIndex = srcEncoder.localIndex(encoded)//产品id对应用户的localIndex
val srcFactor = sortedSrcFactors(blockId)(localIndex)//用户的localIndex即用户blockid,返回用户因子数组
val rating = ratings(i)//产品id对用户的评分
if (implicitPrefs) {//如果使用隐式属性,就是通过用户行为推测对产品的喜好程度,rating代表行为频次,1+alpha*rating代表置信度,alpha取40效果较好,和下面注释的解释一致
// Extension to the original paper to handle b < 0. confidence is a function of |b|//把原始文档扩展到b<0的情况。置信度是b绝对值的函数
// instead so that it is never negative. c1 is confidence - 1.0.//所以非负,c1是置信度-1
val c1 = alpha * math.abs(rating)//评分取非负
// For rating <= 0, the corresponding preference is 0. So the term below is only added//如果评分非正,相应属性为0,只加下面的项
// for rating > 0. Because YtY is already added, we need to adjust the scaling here.//如果评分为正,由于已经加了上三角矩阵向量,我们需要调整比例
if (rating > 0) {//如果为正
numExplicits += 1//评分次数+1
ls.add(srcFactor, (c1 + 1.0) / c1, c1)//用评分c1*用户因子列向量*用户因子列向量转置加到ls已有矩阵A,同时用(c1+1)*用户因子向量加到atb上,c1+1即置信度
}
} else {//如果是默认显式属性
ls.add(srcFactor, rating)//用户因子列向量*用户因子列向量转置加到ls已有矩阵A,如果rating不为0同时用评分rating*用户因子向量加到atb上,注意这里rating可能为负
numExplicits += 1//评分次数+1
}
i += 1
}
// Weight lambda by the number of explicit ratings based on the ALS-WR paper.//精确评分的数量作为权重
dstFactors(j) = solver.solve(ls, numExplicits * regParam)//解出产品j的因子向量,numExplicits * regParam会先加到矩阵对角线
j += 1
}
dstFactors//最终返回产品blockid和每个产品的因子向量
}
}
回到train方法
previousItemFactors.unpersist()//由于已经计算出新的itemFactors,释放previousItemFactors
itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)//加个名字并缓存新的itemFactors
// TODO: Generalize PeriodicGraphCheckpointer and use it here.
if (shouldCheckpoint(iter)) {//定期执行Checkpoint操作
itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
}
val previousUserFactors = userFactors//对用户因子执行同样操作,先存当前的
userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,//计算用户blockid和每个用户的因子向量
itemLocalIndexEncoder, implicitPrefs, alpha, solver)
if (shouldCheckpoint(iter)) {//定期删除,并赋值新的Checkpoint文件,
deletePreviousCheckpointFile()
previousCheckpointFile = itemFactors.getCheckpointFile
}
previousUserFactors.unpersist()//释放previousUserFactors,可见这个过程每一步的Factors都会被缓存重用,然后释放,并定期Checkpoint,且只对itemFactors进行Checkpoint
}
} else {//如果是显示模式
for (iter <- 0 until maxIter) {//迭代
itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,//跟隐式模式一样计算每个分块的产品因子,只不过不传implicitPrefs和alpha参数
userLocalIndexEncoder, solver = solver)
if (shouldCheckpoint(iter)) {//定期执行checkpoint
itemFactors.checkpoint()
itemFactors.count() // checkpoint item factors and cut lineage//执行count切断rdd依赖关系,删除旧的文件保存新文件
deletePreviousCheckpointFile()
previousCheckpointFile = itemFactors.getCheckpointFile
}
userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,
itemLocalIndexEncoder, solver = solver)//跟隐式模式一样计算每个分块的用户因子
}
}
val userIdAndFactors = userInBlocks//我们再来温习一下userIbBlocks存了什么,存储了用户分块及唯一用户id数组,和唯一用户id数组对应产品的累计条数数组,产品编码数组,用户对产品得分
.mapValues(_.srcIds)//取出唯一用户id数组
.join(userFactors)//userFactors存储了用户blockid和每个用户因子向量
.mapPartitions({ items =>//分partition
items.flatMap { case (_, (ids, factors)) =>
ids.view.zip(factors)//把用户id和对应因子向量合并成元组
}
// Preserve the partitioning because IDs are consistent with the partitioners in userInBlocks//不改变partition
// and userFactors.
}, preservesPartitioning = true)
.setName("userFactors")//取名并缓存,最终结构是保留原partition,每个partition是一个用户id和对应因子向量的数组组成的元组
.persist(finalRDDStorageLevel)
val itemIdAndFactors = itemInBlocks//同理最终结构是保留原partition,每个partition是一个产品id和对应因子向量的数组组成的元组
.mapValues(_.srcIds)
.join(itemFactors)
.mapPartitions({ items =>
items.flatMap { case (_, (ids, factors)) =>
ids.view.zip(factors)
}
}, preservesPartitioning = true)
.setName("itemFactors")
.persist(finalRDDStorageLevel)
if (finalRDDStorageLevel != StorageLevel.NONE) {//最后释放各类中间结果,切断userIdAndFactors,itemIdAndFactors这两个RDD的依赖关系
userIdAndFactors.count()
itemFactors.unpersist()
itemIdAndFactors.count()
userInBlocks.unpersist()
userOutBlocks.unpersist()
itemInBlocks.unpersist()
itemOutBlocks.unpersist()
blockRatings.unpersist()
}
(userIdAndFactors, itemIdAndFactors)//返回userIdAndFactors,itemIdAndFactors,数据结构是一个用户或产品id和对应因子向量数组构成的元组的RDD
}
train方法介绍完啦,我们回到ALS主类
val (userFactors, itemFactors) = ALS.train(ratings, rank = $(rank),
numUserBlocks = $(numUserBlocks), numItemBlocks = $(numItemBlocks),
maxIter = $(maxIter), regParam = $(regParam), implicitPrefs = $(implicitPrefs),
alpha = $(alpha), nonnegative = $(nonnegative),
checkpointInterval = $(checkpointInterval), seed = $(seed))//返回 (userFactors, itemFactors)
val userDF = userFactors.toDF("id", "features")//把userFactors,itemFactors转成dataframe
val itemDF = itemFactors.toDF("id", "features")
val model = new ALSModel(uid, $(rank), userDF, itemDF).setParent(this)//ALS模型中装入uid,rank矩阵因子的阶,userDF和itemDF两个dataframe,这就是我们最终的结果
copyValues(model)
}
override def transformSchema(schema: StructType): StructType = {//复写检查转换schema
validateAndTransformSchema(schema)
}
override def copy(extra: ParamMap): ALS = defaultCopy(extra)//复写cpoy方法
到此推荐方法我们介绍完了,再明确几个内容
1.rank到底是什么?假设有5个用户,5个产品,矩阵就是5*5,如果rank=3,那么矩阵被分解成5*3,3*5两个低阶矩阵,这两个矩阵的乘积近似等于原矩阵,类似主成分分析
2.als是如何运行的?通过computeFactors方法交替优化userFactors,itemFactors,在computeFactors内部每个分块的每个用户对应的所有产品构建用户因子向量,每个用户所有因子向量构成一个正规方程,并求解这个方程,结果就是产品因子,如此迭代最终确定userFactors,itemFactors
3.什么是显式属性和隐式属性?我们常用显示属性或者把实际问题转化成显示属性处理,显式就是用户对产品又明确的打分,隐式只有两种情况,用户对产品有链接和无链接,有链接就是1,无连接就是0,用互动的频率的函数来表示为1的置信度,详细可参考Collaborative Filtering for Implicit Feedback Datasets,这也是sparkml推荐算法参考的文档
本文深入解析了基于交替最小二乘法(Alternating Least Squares, ALS)的推荐系统实现细节,包括用户和物品特征矩阵的初始化、特征向量的计算过程、显式与隐式反馈处理方法等。
702

被折叠的 条评论
为什么被折叠?



