spark ml 推荐源码笔记三

本文深入解析了基于交替最小二乘法(Alternating Least Squares, ALS)的推荐系统实现细节,包括用户和物品特征矩阵的初始化、特征向量的计算过程、显式与隐式反馈处理方法等。

上一篇讲到

    val (userInBlocks, userOutBlocks) =
      makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel)

(userInBlocks, userOutBlocks)就是上篇最终结果inBlock,outBlock,继续看

    // materialize blockRatings and user blocks//实例化blockRatings 和用户分块
    userOutBlocks.count()//记录OutBlocks分块数,一般是10,除非没有用户id落到某分块
    val swappedBlockRatings = blockRatings.map {//交换blockRatings成员位置,swappedBlockRatings 最终的数据结构是key-value,key是(itemBlockId, userBlockId)即产品BlockId,用户BlockId,value是RatingBlock(itemIds, userIds, localRatings),itemIds是产品id数组,userIds是用户id数组,localRatings是评分数组
      case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, localRatings)) =>
        ((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, localRatings))
    }

    val (itemInBlocks, itemOutBlocks) =//以产品为维度做一个makeBlocks,这和第二篇讲的以用户为维度做makeBlocks完全一样
      makeBlocks("item", swappedBlockRatings, itemPart, userPart, intermediateRDDStorageLevel)

// materialize item blocks//实例化产品blocks
    itemOutBlocks.count()//这里并没有赋值,是想通过count()方法改变rdd依赖关系,只保留itemOutBlocks这个rdd

    val seedGen = new XORShiftRandom(seed)//随机数生成器

我们先看下initialize方法

  /**
   * Initializes factors randomly given the in-link blocks.//根据传入的inblocks随机初始化因子
   *
   * @param inBlocks in-link blocks//inblocks
   * @param rank rank//矩阵阶数
   * @return initialized factor blocks//返回初始化的因子分块
   */
  private def initialize[ID](
      inBlocks: RDD[(Int, InBlock[ID])],
      rank: Int,
      seed: Long): RDD[(Int, FactorBlock)] = {
    // Choose a unit vector uniformly at random from the unit sphere, but from the//从单位球面随机选择一个单位向量,但是第一象限的元素都是非负的,这可以通过选择标准正态分布的数并取绝对值,再归一化来完成,这样创建因子比从01之前选随机数要好
    // "first quadrant" where all elements are nonnegative. This can be done by choosing
    // elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
    // This appears to create factorizations that have a slightly better reconstruction
    // (<1%) compared picking elements uniformly at random in [0,1].
    inBlocks.map { case (srcBlockId, inBlock) =>//操作传入的inBlocks
      val random = new XORShiftRandom(byteswap64(seed ^ srcBlockId))//随机数生成器
      val factors = Array.fill(inBlock.srcIds.length) {生成以用户id数量为长度的数组
        val factor = Array.fill(rank)(random.nextGaussian().toFloat)//生成以矩阵阶数为长度,值服从高斯分布的数组
        val nrm = blas.snrm2(rank, factor, 1)//将factor标准化即长度为1,并返回
        blas.sscal(rank, 1.0f / nrm, factor, 1)
        factor
      }
      (srcBlockId, factors)//返回srcBlockId,factors,factors相当于对每个用户建立了一个矩阵阶数的数组,且数组长度为1
    }
  }
    var userFactors = initialize(userInBlocks, rank, seedGen.nextLong())//根据刚才讲的构建用户因子和产品因子
    var itemFactors = initialize(itemInBlocks, rank, seedGen.nextLong())
    var previousCheckpointFile: Option[String] = None//初始化checkpointFile为None
    val shouldCheckpoint: Int => Boolean = (iter) =>
      sc.checkpointDir.isDefined && checkpointInterval != -1 && (iter % checkpointInterval == 0)//判断是否需要Checkpoint,如果有Checkpoint的目录,Checkpoint间隔不为-1,迭代次数达到间隔就进行Checkpoint,默认迭代10次进行Checkpoint
    val deletePreviousCheckpointFile: () => Unit = () =>//如果有旧的CheckpointFile就删除
      previousCheckpointFile.foreach { file =>
        try {
          FileSystem.get(sc.hadoopConfiguration).delete(new Path(file), true)
        } catch {
          case e: IOException =>
            logWarning(s"Cannot delete checkpoint file $file:", e)
        }
      }

    if (implicitPrefs) {//如果是隐式属性
      for (iter <- 1 to maxIter) {//一直迭代
        userFactors.setName(s"userFactors-$iter").persist(intermediateRDDStorageLevel)//缓存userFactors
        val previousItemFactors = itemFactors//把itemFactors赋值给previousItemFactors 
        itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
          userLocalIndexEncoder, implicitPrefs, alpha, solver)//终于开始计算了,我们先看下computeFactors

 /**
   * Compute dst factors by constructing and solving least square problems.//通过构建并解决最小平方问题计算用户因子对应的的产品因子
   *
   * @param srcFactorBlocks src factors//用户因子
   * @param srcOutBlocks src out-blocks//userOutBlocks 
   * @param dstInBlocks dst in-blocks//itemInBlocks
   * @param rank rank//矩阵阶数
   * @param regParam regularization constant//正则化参数
   * @param srcEncoder encoder for src local indices//用户本地索引编码器
   * @param implicitPrefs whether to use implicit preference//是否采用隐式属性
   * @param alpha the alpha constant in the implicit preference formulation//隐式属性公式参数
   * @param solver solver for least squares problems//采用哪种方法解决最小平方问题
   *
   * @return dst factors//返回和用户因子对应的产品因子
   */
  private def computeFactors[ID](//以用户因子为例,srcFactorBlocks存储了用户分块id及这个分块下每个用户对应的阶长数组,srcOutBlocks存储了用户分块id及和这个分块有链接的各产品分块对应的用户本地索引数组,dstInBlocks存储了产品分块及唯一产品id数组,和唯一产品id数组对应用户的累计条数数组,用户编码数组,产品对用户得分
      srcFactorBlocks: RDD[(Int, FactorBlock)],
      srcOutBlocks: RDD[(Int, OutBlock)],
      dstInBlocks: RDD[(Int, InBlock[ID])],
      rank: Int,
      regParam: Double,
      srcEncoder: LocalIndexEncoder,
      implicitPrefs: Boolean = false,
      alpha: Double = 1.0,
      solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)] = {
    val numSrcBlocks = srcFactorBlocks.partitions.length//取srcFactorBlocks分区数量
    val YtY = if (implicitPrefs) Some(computeYtY(srcFactorBlocks, rank)) else None//如果是隐式属性则计算格拉姆矩阵,我们看下computeYtY方法

  /**
   * Computes the Gramian matrix of user or item factors, which is only used in implicit preference.//根据用户或产品因子计算格拉姆矩阵,这里应该是针对隐式属性,缓存输入的因子
   * Caching of the input factors is handled in [[ALS#train]].
   */
  private def computeYtY(factorBlocks: RDD[(Int, FactorBlock)], rank: Int): NormalEquation = {//还以输入的因子为用户因子,阶数为例,返回正规方程
    factorBlocks.values.aggregate(new NormalEquation(rank))(//只取factorBlocks的value部分,以新的正规方程为初始值进行聚合,相当于以每个用户数组为列向量乘以它的转置,得到的矩阵再叠加,只不过NormalEquation是以上三角向量的方式存储的对称矩阵
      seqOp = (ne, factors) => {
        factors.foreach(ne.add(_, 0.0))
        ne
      },
      combOp = (ne1, ne2) => ne1.merge(ne2))
  }

回到computeFactors
    val srcOut = srcOutBlocks.join(srcFactorBlocks).flatMap {//srcOutBlocks和srcFactorBlocks的信息合并,key是用户分块,value是两个二维数组组,有用户数组,外链产品分块的用户本地索引数组
      case (srcBlockId, (srcOutBlock, srcFactors)) =>
        srcOutBlock.view.zipWithIndex.map { case (activeIndices, dstBlockId) =>//搞出产品blockid
          (dstBlockId, (srcBlockId, activeIndices.map(idx => srcFactors(idx))))//通过用户本地索引找出用户id因子值
        }
    }//返回key是产品分块id,value的key是用户分块id,value的value是这个产品对应用户分块里每个产品分块id对应的用户因子数组
    val merged = srcOut.groupByKey(new ALSPartitioner(dstInBlocks.partitions.length))//把刚才的数据结构转成key对应数组的形式
    dstInBlocks.join(merged).mapValues {//把dstInBlocks和merged合并,处理value
      case (InBlock(dstIds, srcPtrs, srcEncodedIndices, ratings), srcFactors) =>//这数据结构真够复杂的
        val sortedSrcFactors = new Array[FactorBlock](numSrcBlocks)//生成一个存储 srcFactorBlocks的数组
        srcFactors.foreach { case (srcBlockId, factors) =>
          sortedSrcFactors(srcBlockId) = factors//把刚才groupByKey的结果存成数组,最终sortedSrcFactors是三维数组,第一维是产品blockid,第二维是用户blockid,第三维是用户因子数组
        }
        val dstFactors = new Array[Array[Float]](dstIds.length)//又声明了一个二维数组
        var j = 0
        val ls = new NormalEquation(rank)//声明正规方程
        while (j < dstIds.length) {//遍历产品id
          ls.reset()//清空正规方程
          if (implicitPrefs) {//如果使用隐式属性
            ls.merge(YtY.get)//把YtY加到ls上
          }
          var i = srcPtrs(j)//产品id对应用户数组起始索引
          var numExplicits = 0//每个产品关联用户的数量,即评分次数
          while (i < srcPtrs(j + 1)) {//产品id对应用户数组终止索引
            val encoded = srcEncodedIndices(i)//产品id对应用户的编码
            val blockId = srcEncoder.blockId(encoded)//产品id对应用户的blockid
            val localIndex = srcEncoder.localIndex(encoded)//产品id对应用户的localIndex
            val srcFactor = sortedSrcFactors(blockId)(localIndex)//用户的localIndex即用户blockid,返回用户因子数组
            val rating = ratings(i)//产品id对用户的评分
            if (implicitPrefs) {//如果使用隐式属性,就是通过用户行为推测对产品的喜好程度,rating代表行为频次,1+alpha*rating代表置信度,alpha取40效果较好,和下面注释的解释一致
              // Extension to the original paper to handle b < 0. confidence is a function of |b|//把原始文档扩展到b<0的情况。置信度是b绝对值的函数
              // instead so that it is never negative. c1 is confidence - 1.0.//所以非负,c1是置信度-1
              val c1 = alpha * math.abs(rating)//评分取非负
              // For rating <= 0, the corresponding preference is 0. So the term below is only added//如果评分非正,相应属性为0,只加下面的项
              // for rating > 0. Because YtY is already added, we need to adjust the scaling here.//如果评分为正,由于已经加了上三角矩阵向量,我们需要调整比例
              if (rating > 0) {//如果为正
                numExplicits += 1//评分次数+1
                ls.add(srcFactor, (c1 + 1.0) / c1, c1)//用评分c1*用户因子列向量*用户因子列向量转置加到ls已有矩阵A,同时用(c1+1)*用户因子向量加到atb上,c1+1即置信度
              }
            } else {//如果是默认显式属性
              ls.add(srcFactor, rating)//用户因子列向量*用户因子列向量转置加到ls已有矩阵A,如果rating不为0同时用评分rating*用户因子向量加到atb上,注意这里rating可能为负
              numExplicits += 1//评分次数+1
            }
            i += 1
          }
          // Weight lambda by the number of explicit ratings based on the ALS-WR paper.//精确评分的数量作为权重
          dstFactors(j) = solver.solve(ls, numExplicits * regParam)//解出产品j的因子向量,numExplicits * regParam会先加到矩阵对角线
          j += 1
        }
        dstFactors//最终返回产品blockid和每个产品的因子向量
    }
  }

回到train方法

        previousItemFactors.unpersist()//由于已经计算出新的itemFactors,释放previousItemFactors
        itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)//加个名字并缓存新的itemFactors
        // TODO: Generalize PeriodicGraphCheckpointer and use it here.
        if (shouldCheckpoint(iter)) {//定期执行Checkpoint操作
          itemFactors.checkpoint() // itemFactors gets materialized in computeFactors.
        }
        val previousUserFactors = userFactors//对用户因子执行同样操作,先存当前的
        userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,//计算用户blockid和每个用户的因子向量
          itemLocalIndexEncoder, implicitPrefs, alpha, solver)
        if (shouldCheckpoint(iter)) {//定期删除,并赋值新的Checkpoint文件,
          deletePreviousCheckpointFile()
          previousCheckpointFile = itemFactors.getCheckpointFile
        }
        previousUserFactors.unpersist()//释放previousUserFactors,可见这个过程每一步的Factors都会被缓存重用,然后释放,并定期Checkpoint,且只对itemFactors进行Checkpoint
      }

  } else {//如果是显示模式
      for (iter <- 0 until maxIter) {//迭代
        itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,//跟隐式模式一样计算每个分块的产品因子,只不过不传implicitPrefs和alpha参数
          userLocalIndexEncoder, solver = solver)
        if (shouldCheckpoint(iter)) {//定期执行checkpoint
          itemFactors.checkpoint()
          itemFactors.count() // checkpoint item factors and cut lineage//执行count切断rdd依赖关系,删除旧的文件保存新文件
          deletePreviousCheckpointFile()
          previousCheckpointFile = itemFactors.getCheckpointFile
        }
        userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,
          itemLocalIndexEncoder, solver = solver)//跟隐式模式一样计算每个分块的用户因子
      }
    }

    val userIdAndFactors = userInBlocks//我们再来温习一下userIbBlocks存了什么,存储了用户分块及唯一用户id数组,和唯一用户id数组对应产品的累计条数数组,产品编码数组,用户对产品得分
      .mapValues(_.srcIds)//取出唯一用户id数组
      .join(userFactors)//userFactors存储了用户blockid和每个用户因子向量
      .mapPartitions({ items =>//分partition
        items.flatMap { case (_, (ids, factors)) =>
          ids.view.zip(factors)//把用户id和对应因子向量合并成元组
        }
      // Preserve the partitioning because IDs are consistent with the partitioners in userInBlocks//不改变partition
      // and userFactors.
      }, preservesPartitioning = true)
      .setName("userFactors")//取名并缓存,最终结构是保留原partition,每个partition是一个用户id和对应因子向量的数组组成的元组
      .persist(finalRDDStorageLevel)
    val itemIdAndFactors = itemInBlocks//同理最终结构是保留原partition,每个partition是一个产品id和对应因子向量的数组组成的元组
      .mapValues(_.srcIds)
      .join(itemFactors)
      .mapPartitions({ items =>
        items.flatMap { case (_, (ids, factors)) =>
          ids.view.zip(factors)
        }
      }, preservesPartitioning = true)
      .setName("itemFactors")
      .persist(finalRDDStorageLevel)
    if (finalRDDStorageLevel != StorageLevel.NONE) {//最后释放各类中间结果,切断userIdAndFactors,itemIdAndFactors这两个RDD的依赖关系
      userIdAndFactors.count()
      itemFactors.unpersist()
      itemIdAndFactors.count()
      userInBlocks.unpersist()
      userOutBlocks.unpersist()
      itemInBlocks.unpersist()
      itemOutBlocks.unpersist()
      blockRatings.unpersist()
    }
    (userIdAndFactors, itemIdAndFactors)//返回userIdAndFactors,itemIdAndFactors,数据结构是一个用户或产品id和对应因子向量数组构成的元组的RDD
  }

train方法介绍完啦,我们回到ALS主类

    val (userFactors, itemFactors) = ALS.train(ratings, rank = $(rank),
      numUserBlocks = $(numUserBlocks), numItemBlocks = $(numItemBlocks),
      maxIter = $(maxIter), regParam = $(regParam), implicitPrefs = $(implicitPrefs),
      alpha = $(alpha), nonnegative = $(nonnegative),
      checkpointInterval = $(checkpointInterval), seed = $(seed))//返回 (userFactors, itemFactors) 
    val userDF = userFactors.toDF("id", "features")//把userFactors,itemFactors转成dataframe
    val itemDF = itemFactors.toDF("id", "features")
    val model = new ALSModel(uid, $(rank), userDF, itemDF).setParent(this)//ALS模型中装入uid,rank矩阵因子的阶,userDF和itemDF两个dataframe,这就是我们最终的结果
    copyValues(model)
  }


  override def transformSchema(schema: StructType): StructType = {//复写检查转换schema
    validateAndTransformSchema(schema)
  }


  override def copy(extra: ParamMap): ALS = defaultCopy(extra)//复写cpoy方法

到此推荐方法我们介绍完了,再明确几个内容

1.rank到底是什么?假设有5个用户,5个产品,矩阵就是5*5,如果rank=3,那么矩阵被分解成5*3,3*5两个低阶矩阵,这两个矩阵的乘积近似等于原矩阵,类似主成分分析

2.als是如何运行的?通过computeFactors方法交替优化userFactors,itemFactors,在computeFactors内部每个分块的每个用户对应的所有产品构建用户因子向量,每个用户所有因子向量构成一个正规方程,并求解这个方程,结果就是产品因子,如此迭代最终确定userFactors,itemFactors

3.什么是显式属性和隐式属性?我们常用显示属性或者把实际问题转化成显示属性处理,显式就是用户对产品又明确的打分,隐式只有两种情况,用户对产品有链接和无链接,有链接就是1,无连接就是0,用互动的频率的函数来表示为1的置信度,详细可参考Collaborative Filtering for Implicit Feedback Datasets,这也是sparkml推荐算法参考的文档

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值