毕设第三周（Pregel in GraphX以及Shortest Path）

最新推荐文章于 2021-01-23 15:46:06 发布

myta0424

最新推荐文章于 2021-01-23 15:46:06 发布

阅读量1.5k

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark Graphx 毕业设计文章标签：周报 pregel GraphX 最短路径

本文链接：https://blog.youkuaiyun.com/u011033990/article/details/49849167

毕业设计同时被 2 个专栏收录

4 篇文章

订阅专栏

Spark Graphx

3 篇文章

订阅专栏

本文介绍了作者在研究Pregel在GraphX中的实现时，对ShortestPaths算法的分析。内容包括算法正确性、效率分析及Pregel在GraphX与标准Pregel的架构差异。作者指出，GraphX的Pregel模型借鉴了GAS，将消息构造从顶点程序中分离，以提高效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Pregel in GraphX

上周看到GraphX lib里面自带的ShortestPaths.scala用的框架为Pregel，于是这周我首先看了Pregel在GraphX上的实现。在读代码的过程中，我顺便过了一下部分GraphX的API。

  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
  //这句代码是说，用initialMsg来初始化程序，生成新的graph，新graph的结构与原graph相同，不同的是，vertex的属性发生了变化。
    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    /*
    mapReduceTriplets()接受一个mapFunc sendMsg和一个reduceFunc mergeMsg，mapFunc会在所有edge上被invoke，返回值为一个(VertexId, A)类型的迭代器，代表需要发送给srcVertex或dstVertex的信息。而reduceFunc来处理这些信息。
    */
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
    /*innerJoin函数会接受两个参数，一个是需要参与innerJoin的RDD，另一个是对匹配项的操作函数，在ShortestPath中，相当于对于任意点V，在已经merge的message中，找到与之匹配某个点，然后对它们的距离进行比较，取较小的作为新的值。
    */
      val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
      prevG = g
      /*与innerJoin类似，outerJoin也是接受两个参数，mapFunc判断是否有新的值产生，如果有，则返回新的值，否则返回旧值
    */
      g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }
      //**cache什么时候用还是没有明白**
      g.cache()

      val oldMessages = messages
      /* 此时的mapReduceTriplets的参数增加了一个Some((newVerts, activeDirection)) 这句代码的意思是，找出本次迭代后仍然活跃的点，然后对每个边进行测试，如果某条边是某个活跃点的出边或入边（activeDirection是either的情况为例）那么就在这条边上实行mapFunc */
      messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache()
      activeMessages = messages.count()

      logInfo("Pregel finished iteration " + i)

     //中间变量从内存中清除
      oldMessages.unpersist(blocking=false)
      newVerts.unpersist(blocking=false)
      prevG.unpersistVertices(blocking=false)
      prevG.edges.unpersist(blocking=false)
      i += 1
    }
    g
}

ShortestPaths.scala算法分析(边无权值)

分布式环境下的算法设计真的是一个难点，有些算法即使是理解起来也不是那么直观。比如lib中的ShortestPaths，虽然通过看代码，我基本明白了Pregel框架的运行流程，代码有些明白了，但算法还需要额外花时间消化。
我们来看一下程序到底做了哪些事情，因为没有相关资料，以下都是我个人的分析，严谨性可能有问题。

算法正确性分析

不考虑迭代次数的限制，程序的终止条件是activeMessage的数量为0，这意味着（参照周报2），不会再有

edge.srcAttr != addMaps(newAttr, edge.srcAttr)

成立，即不会再有更短的距离被更新了。我们假设此时a -> b -> c ->… j -> k不是最优的路线，那么存在 a -> b1 -> c1 -> … j1 -> k更近，那么 b1到k的距离就会小于b到k的距离，所以此时边 a -> b1会发送信息给a，并且使得a产生状态更新，activeMessage不为0，矛盾。
所以终止条件能够保证算法的正确性。下面就要考虑算法能不能终止了。算法不能终止的条件是，每一次迭代都会有新的message产生，即都会发生某个点状态的更新。我们知道，每次点状态的更新至少会使得这个点到目标点的距离减少1（当目标点不可到达的时候，不会发生点状态的更新，所以不用考虑），因为最长距离为某个小于 |V| 的正数，所以一定会减小到0为止，所以迭代次数最多为

| V | \times | V |

$|V| \times |V|$ ，所以一定会终止，算法的终止性可以保证。

算法效率分析

算法的效率分析感觉很困难，算法的效率直接取决于迭代次数与每次迭代的通信量。最关键的一句代码如下：

messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache()

activeDirection决定了某一条边上要不要执行sendMsg（影响本次迭代的复杂度），而messages的count决定要不要进行下一次迭代（影响迭代次数）。而我们知道在分布式环境下，算法效率也受partition数量的影响，因此，目前我个人认为对算法效率的分析只能基于实验结果，理论分析的话很难。

Pregel in GraphX的架构分析

在明风的一篇文章里面提到了GraphX上对Pregel的实现与标准的Pregel是不同之处，是借鉴过GAS的，原文如下

GraphX中的Pregel接口，并不严格遵循Pregel模型，它是一个参考GAS改进的Pregel模型。定义如下：

这种基于mrTrilets方法的Pregel模型，与标准Pregel的最大区别是，它的第2段参数体接收的是3个函数参数，而不接收messageList。它不会在单个顶点上进行消息遍历，而是将顶点的多个Ghost副本收到的消息聚合后，发送给Master副本，再使用vprog函数来更新点值。消息的接收和发送都被自动并行化处理，无需担心超级节点的问题。

我之前并没有觉察，应该是对Pregel和GAS系统都没有很好的理解，于是我又读了 Pregel: A System for Large-Scale Graph Processing 这篇Paper。

可以看一下标准的Pregel对ShortestPath的实现

  class ShortestPathVertex
: public Vertex<int, int, int> {
    void Compute(MessageIterator* msgs) {
    int mindist = IsSource(vertex_id()) ? 0 : INF;
    for (; !msgs->Done(); msgs->Next())
        mindist = min(mindist, msgs->Value());
    if (mindist < GetValue()) {
        *MutableValue() = mindist;
        OutEdgeIterator iter = GetOutEdgeIterator();
        for (; !iter.Done(); iter.Next())
        SendMessageTo(iter.Target(),
        mindist + iter.GetValue());
    }
    VoteToHalt();
    }
};

Pregel的设计理念是Think like a Vertex，从算法中我们也可以看出，算法基本上定义的就是在每个superstep中，每个Vertex如何接受、处理、发送信息。于是在Pregel中，对message的产生也是在Vertex处的运算产生的。在每次迭代中，Vertex会接收上一个superstep中发给它的信息，进行运算，然后产生一系列信息并进行发送。GraphX中的Pregel却并不是这样处理的。
在GraphX: A Resilient Distributed Graph System on Spark 的4.1节中有这样一段话，第一遍读paper的时候根本没有注意到（其实是没有理解）。在GraphX中，message的产生从vertex-program中剥离出来了，优点paper中所说的有些含糊，我目前的体会是剪枝方便了（activeDirection）。虽然Pregel中也可以用combiner来进行剪枝。

unlike the original Pregel API in which the vertex program is passed the set of neighbors and returns a list of messages, our implementation learns form the observations made by[6] and lifts the message counstruction out of the vertex-program. Messages are computed using a message generation function which takes an edge containing the source and target attributes and returns a message or void indicating the absence of a message. By lifting the message construction out the vertex-program, we are able to achieve a more efficient execution than the original Pregel framework and leverage the vertex-cut representation. Moreover, message construction for a single vertex can be distributed over the cluster, moved to the receiving machine, and executed in an efficient order.

其中引用的[6]为 Powergraph: Distributed Graph-Parallel Computation on Natural Graphs 这篇paper，于是很自然的，就是下周的内容了。

总结

GraphX中lib中给了很多的算法样例，比如ShortestPaths、ConnectedComponents与PageRank等，这些算法很多都是基于Pregel的，然而GraphX中对Pregel的实现又与标准的Pregel有所不同，对Powergraph有一定的借鉴。下一周我要把lib中其他几个实现看一下，并着重对Powergraph做一些调研。