【大数据分析】基于Spark Graphx的图路径规划算法PathPlanning实现

原创已于 2022-10-31 14:14:05 修改 · 802 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark #算法 #大数据

于 2022-10-12 15:30:29 首次发布

Spark 专栏收录该内容

38 篇文章

订阅专栏

本文介绍了基于Spark Graphx的PathPlanning算法，该算法利用Pregel机制在有限迭代内查找图数据中起点S到目标点T的所有路径。内容包括数据准备、节点属性定义、初始化、消息传递与合并等关键步骤的详细解析，并提供了代码示例。

算法介绍

PathPlanning是基于 Spark Graphx 中的 Pregel 机制实现的算法。关于Pregel机制的理解可参考【大数据分析】基于Graphx的shortestpath源码解析。PathPlanning可以在有限的迭代次数内尽可能多地计算出图数据中起始点 $S$ 到目标点 $T$ 的所有路径。

算法解析

数据的准备

创建样例数据的代码

    val myVertices = sc.makeRDD(Array(
      (1L, "Dave"),
      (2L, "Faith"),
      (3L, "Harvey"),
      (4L, "Bob"),
      (5L, "Alice"),
      (6L, "Charlie"),
      (7L, "George"),
      (8L, "Ivy")
    ))

    val myEdges = sc.makeRDD(Array(
      Edge(7L, 1L, "friend"),
      Edge(7L, 2L, "sister"),
      Edge(7L, 6L, "friend"),
      Edge(1L, 4L, "friend"),
      Edge(4L, 1L, "brother"),
      Edge(3L, 2L, "boss"),
      Edge(2L, 3L, "client"),
      Edge(2L, 4L, "client"),
      Edge(1L, 5L, "client"),
      Edge(4L, 5L, "coworker"),
      Edge(3L, 8L, "coworker"),
      Edge(5L, 8L, "father"),
      Edge(4L, 8L, "colleague")
    ))

样例数据的示意图，我们这次的S是1L，T是8L）如下：
在这里插入图片描述

关于图算法需要思考的问题

基于Pregel机制实现的算法，一般需要考虑几个问题
（1）如何定义节点的属性结构？
（2）如何初始化节点的属性？
（3）在进行第一次迭代前，如何激活所有的节点？
（4）如何传递消息（节点状态如何变化，消息传递的方向，如何进行消息的更新）
（5）接收到多个消息如何将它们进行组合（merge）
（6）最终接收到的消息如何与当前节点的属性组合（vertex_program）

定义节点的属性类型

代码中定义了两个case class类型和一个MsgValue类型

  /**
   * @desc The path instance type including the definition of length of path and the definition of path
   */
  case class PathInstance(l: Double, p: List[VertexId])

  /**
   * @param dstId         Used to store S's ID
   * @param pathInstances An ArrayBuffer that used to store a path that combined with a list of nodes
   */
  case class MsgValue(dstId: VertexId, pathInstances: ArrayBuffer[PathInstance])
  type MsgType = Map[VertexId, MsgValue]

（1）PathInstance 代表路径实例类型，一个路径实例包含路径 p ，以及对应的路径长度 l 。
（2）MsgValue 包含两个变量，dstId 和 pathInstances 。dstId 是算法需要计算的目标节点 T 。pathInstances 是路径列表。
（3）MsgType是每个节点的属性类，它是一个Map。其中key是VertexId，是 B 的入度边相邻节点，如：A $\rightarrow$ B中的 A 。换言之，节点 $B$ 存储着 $A$ 的节点 ID（作为key），以及 $A$ 的所有的路径（ $A$ 的所有路径加上 $B$ 的ID）。

节点属性的初始化

    val PGraph = graph.mapVertices { (vid, attr) =>
      if (vid == source) {
        val dstId = target
        val pi = PathInstance(l = 0, p = List(vid))
        val pis = new ArrayBuffer[PathInstance]()
        pis += pi
        val msgValue = MsgValue(dstId = dstId, pathInstances = pis)
        makeMsg(vid -> msgValue)
      } else {
        makeMsg()
      }
    }

结果如下图所示。
在这里插入图片描述

初始化激活所有的节点

在这里插入图片描述
激活所以节点需要一个初始消息，这里是一个仅有一个元素的Map

val initialMessage = makeMsg(-1L -> null)

然后激活所有节点时，会直接触发vertexProgram方法。initialMessage 会作为msg参数传入。

消息的传递

消息的传递由sendMsg确定。假设将所有的triplet定义为：A $\rightarrow$ B，消息的传递算法如下所示

1、根据节点的激活和休眠进行三元组的筛选
筛选出 A 或 B 处于激活态的三元组 A $\rightarrow$ B。

2、消息的构建
将A中处于不同key的路径实例收集在一起，其中路径长度+边的权重，路径末尾加上B。

3、是否产生消息传递
对于 A 或 B 是出于激活态的 A $\rightarrow$ B ，有以下几种情况，不发生消息传递。
（1）A中的Map没有元素。
（2）2 构建的消息中，其路径存在重复的节点。
（3）A 是我们需要计算的目标点 S
（4）2 构建的消息，与 B完全相同。
对于（1）和（3）如下图，2,3,4,5,6,7 作为 A不会与对应的B发生消息传递。8既T，也作为A时，不发生消息传递。
在这里插入图片描述
而对于（2）和（4）如下图，4 $\rightarrow$ 1 ，路径会出现重复节点，满足（2），1 $\rightarrow$ 4 和 1 $\rightarrow$ 5满足（4）

4、消息传递的方向
消息传递的方向从总体上是从A到B，即将（3）构建的消息发送给 $B$ 。

消息与消息的合并

已知两个三元组 $1\rightarrow 4$ 和 $2\rightarrow 4$ ，它们都会将消息发给节点4，而在节点4收到消息前，两个消息需要合并，mergeMsg负责合并两条消息。这里mergeMsg直接将两个消息合并，因为它们的key互不相同，并且都只有一个元素。

消息与属性的合并

vertexProgram负责将合并后的消息和当前接收消息的节点的属性进一步合并，假设当前属性为attr，msg是接受到的消息，attr直接被替换为msg。

算法迭代完成过程

在这里插入图片描述

代码

package com.edata.bigdata.algorithm.networks.approximation

import org.apache.spark.graphx.{EdgeTriplet, Graph, Pregel, VertexId}

import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag



/**
 * @Description: Suppose that source node S, and target node T
 *        Calculate the possible paths and the corresponding lengths from the source node S to the target node T.
 *        This result of this algorithm is only an estimate. As the iteration continue,more and more correct result will be calculated.
 * @Author @Author: Alan Sword
 * @Date 10:42
 * @Version 1.0
 * */
object PathPlanning extends Serializable {

  /**
   * @desc The path instance type including the definition of length of path and the definition of path
   */
  case class PathInstance(l: Double, p: List[VertexId])

  /**
   * @param dstId         Used to store S's ID
   * @param pathInstances An ArrayBuffer that used to store a path that combined with a list of nodes
   */
  case class MsgValue(dstId: VertexId, pathInstances: ArrayBuffer[PathInstance])

  type MsgType = Map[VertexId, MsgValue]

  /**
   * @param x a 'key -> value' type element.
   * @return a 'MsgType' Map.
   * @Description Use  parameter 'x' to create a vertex's attribute,where 'x' is the neighbor of current node
   */
  private def makeMsg(x: (VertexId, MsgValue)*) = Map(x: _*)

  /**
   * @param edge a edge triplet (A->B).
   * @return true or false.
   * @Description Use A's attribute to create a new attribute
   */
  private def updateMsgValue(edge: EdgeTriplet[MsgType, _]): MsgValue = {
    val edgeDstId = edge.dstId
    val edgeWeight = 1
    val dstId = edge.srcAttr.values.map(data => data.dstId).reduce((x, y) => x)
    val pathInstances = edge.srcAttr.values.map(data => data.pathInstances).reduce((x, y) => x ++= y).map(pi => {
      val l = pi.l + edgeWeight
      val p = pi.p :+ edgeDstId
      PathInstance(l, p)
    }).distinct
    MsgValue(dstId, pathInstances)
  }


  /**
   * 
   * @param msg1 a message from a neighbor.
   * @param msg2 a message from another neighbor.
   * @return a 'MsgType' Map that is made from a combination of msg1 and msg2.
   * @Description
   */
  private def mergeMsg(msg1: MsgType, msg2: MsgType): MsgType = {
    msg1 ++ msg2
  }

  /**
   
   * @param id   The vertex id of the node that receives message.
   * @param attr The attribute of node N.
   * @param msg  The Message after 'mergeMsg'.
   * @return attr or msg.
   * @Description Updates the attribute of node N.
   */
  def vertexProgram(vid: VertexId, attr: MsgType, msg: MsgType): MsgType = {
    if (msg.keySet.contains(-1L)) {
      attr
    } else {
      val attr_msg = (attr.keySet ++ msg.keySet).map {
        k => k -> msg.getOrElse(k, attr.getOrElse(k, null))
      }.toMap
      attr_msg
    }
  }

  /**
   * @param edge a edge triplet (A->B).
   * @return
   * @Description Send message in 'Iterator[(VertexId,MsgType)]' format between node and node.
   */
  private def sendMsg(edge: EdgeTriplet[MsgType, _]): Iterator[(VertexId, MsgType)] = {
    //
    if (edge.srcAttr.isEmpty) return Iterator.empty
    val msg_value_new = updateMsgValue(edge)
    val path_instances_new = msg_value_new.pathInstances
    if (path_instances_new.exists(p => p.p.distinct.length < p.p.length)) return Iterator.empty
    val srcId = edge.srcId
    val msg_value_dst = edge.dstAttr.getOrElse(srcId, MsgValue(dstId = 0, pathInstances = new ArrayBuffer[PathInstance]()))
    if (msg_value_dst.dstId == srcId) return Iterator.empty
    val path_instances_dst = msg_value_dst.pathInstances
    if (path_instances_new.containsSlice(path_instances_dst) && path_instances_dst.containsSlice(path_instances_new))
      return Iterator.empty
    val msg = makeMsg(edge.srcId -> msg_value_new)
    Iterator((edge.dstId, msg))
  }

  /**
   *
   * @param graph  The graph that needs to be calculated
   * @param source The starting point
   * @param target The ending point
   * @tparam VD
   * @tparam ED
   * @Description
   */
  def run[VD, ED: ClassTag](graph: Graph[VD, ED], source: VertexId, target: VertexId,maxIterations:Int): Graph[MsgType, ED] = {
    val PGraph = graph.mapVertices { (vid, attr) =>
      if (vid == source) {
        val dstId = target
        val pi = PathInstance(l = 0, p = List(vid))
        val pis = new ArrayBuffer[PathInstance]()
        pis += pi
        val msgValue = MsgValue(dstId = dstId, pathInstances = pis)
        makeMsg(vid -> msgValue)
      } else {
        makeMsg()
      }
    }
    val initialMessage = makeMsg(-1L -> null)
    Pregel(PGraph, initialMessage, maxIterations = maxIterations)(vertexProgram, sendMsg, mergeMsg)
  }
}

运行结果如下，

(6,Map(7 -> MsgValue(7,ArrayBuffer(PathInstance(1.0,List(7, 6))))))
(2,Map(7 -> MsgValue(7,ArrayBuffer(PathInstance(1.0,List(7, 2)))), 3 -> MsgValue(7,ArrayBuffer())))
(8,Map(3 -> MsgValue(7,ArrayBuffer(PathInstance(3.0,List(7, 2, 3, 8)))), 5 -> MsgValue(7,ArrayBuffer(PathInstance(4.0,List(7, 1, 4, 5, 8)), PathInstance(4.0,List(7, 2, 4, 5, 8)))), 4 -> MsgValue(7,ArrayBuffer(PathInstance(3.0,List(7, 1, 4, 8)), PathInstance(3.0,List(7, 2, 4, 8))))))
(7,Map(7 -> MsgValue(7,ArrayBuffer(PathInstance(0.0,List(7))))))
(4,Map(1 -> MsgValue(7,ArrayBuffer(PathInstance(2.0,List(7, 1, 4)))), 2 -> MsgValue(7,ArrayBuffer(PathInstance(2.0,List(7, 2, 4))))))
(3,Map(2 -> MsgValue(7,ArrayBuffer(PathInstance(2.0,List(7, 2, 3))))))
(1,Map(7 -> MsgValue(7,ArrayBuffer(PathInstance(1.0,List(7, 1))))))
(5,Map(4 -> MsgValue(7,ArrayBuffer(PathInstance(3.0,List(7, 1, 4, 5)), PathInstance(3.0,List(7, 2, 4, 5))))))