Spark GraphX基础

Spark GraphX

  • 图(Graph)的基本概念

图是由顶点集合(vertex)及顶点间的关系集合(边edge)组成的一种网状数据结构;通常表示为二元组:Gragh=(V,E)

  • GraphX特点

  • [1] 基于内存实现了数据的复用与快速读取

  • [2] 通过弹性分布式属性图(Property Graph)统一了图视图与表视图

  • [3] 与Spark Streaming、Spark SQL和Spark MLlib等无缝衔接

  • GraphX核心抽象

    弹性分布式属性图(Resilient Distributed Property Graph)
    顶点和边都带属性的有向多重图
    在这里插入图片描述
    一份物理存储,两种视图
    对Graph视图的所有操作,最终都会转换成其关联的Table视图的RDD操作来完成

在这里插入图片描述

  • GraphX API

class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD, ED]]
}
//导入
import org.apache.spark.graphx._
//创建顶点rdd
val vertices=sc.parallelize(Array(
        (1L, ("Alice", 28)),
        (2L, ("Bob", 27)),
        (3L, ("Charlie", 65)),
        (4L, ("David", 42)),
        (5L, ("Ed", 55)),
        (6L, ("Fran", 50))
      ))
//创建edges边rdd
val edges=sc.parallelize(Array(
        Edge(2L, 1L, 7),
        Edge(2L, 4L, 2),
        Edge(3L, 2L, 4),
        Edge(3L, 6L, 3),
        Edge(4L, 1L, 1),
        Edge(5L, 2L, 2),
        Edge(5L, 3L, 8),
        Edge(5L, 6L, 3)
      ))
//创建graph对象
 val graph=Graph(vertices,edges)
//获取graph图对象的vertices信息
graph.vertices.collect
//获取graph图对象的edges信息
graph.edges.collect
class Graph[VD, ED] {
  val numEdges: Long   //边数量
  val numVertices: Long   //顶点数量
  val inDegrees: VertexRDD[Int]  //入度
  val outDegrees: VertexRDD[Int]  //出度
  val degrees: VertexRDD[Int]  //出度+入度
}
  • 图的算子

  • 属性算子,类似于RDD的map操作
class Graph[VD, ED] {
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
}

val t1_graph = graph.mapVertices { case(vertextId, (name, age)) => (vertextId, name) }
val t2_graph = graph.mapVertices { (vertextId, attr) => (vertextId, attr._1) }
val t3_graph = graph.mapEdges(e =>e.attr*7.0))

  • 结构算子
class Graph[VD, ED] {
  def reverse: Graph[VD, ED]  //改变边的方向
  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,   //生成满足顶点与边的条件的子图
               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
  }
  
val t1_graph = graph.reverse
val t2_graph = graph.subgraph(vpred=(id,attr)=>attr._2<65)  //attr:(name,age)

  • Join算子:从外部的RDDs加载数据,修改顶点属性
 class Graph[VD, ED] {
  def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD): Graph[VD, ED]
  def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)
    : Graph[VD2, ED]
}

val comps= sc.parallelize(Array((1L, "kgc.cn"), (2L, "berkeley.edu"), (3L, "apache.org")))
val c_graph = graph.joinVertices(comps)((id,v,c)=>(v._1+"@"+c,v._2))
c_graph.vertices.foreach(println)

=>运行结果
(4,(David,42))
(1,(Alice@kgc.cn,28))
(6,(Fran,50))
(3,(Charlie@apache.org,65))
(5,(Ed,55))
(2,(Bob@berkeley.edu,27))

 //RDD中的顶点不匹配时,值为None
val c_graph = graph.outerJoinVertices(comps)((id,v,c)=>(v._1+"@"+c,v._2))
c_graph.vertices.foreach(println)

=>运行结果
(4,(David@None,42))
(1,(Alice@Some(kgc.cn),28))
(6,(Fran@None,50))
(3,(Charlie@Some(apache.org),65))
(5,(Ed@None,55))
(2,(Bob@Some(berkeley.edu),27))
  • GraphX API 应用

【计算用户粉丝数量】

case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
//修改顶点属性
val initialUserGraph= graph.mapVertices{ 
     case (id, (name, age)) => User(name, age, 0, 0) 
}
//将顶点入度、出度存入顶点属性中 
val userGraph = initialUserGraph.outerJoinVertices(initialUserGraph.inDegrees) {
     case (id, u, inDegOpt) => User(u.name, u.age, inDegOpt.getOrElse(0), u.outDeg)
}.outerJoinVertices(initialUserGraph.outDegrees) {
    case (id, u, outDegOpt) => User(u.name, u.age, u.inDeg, outDegOpt.getOrElse(0))
}
//顶点的入度即为粉丝数量
for ((id, property) <- userGraph.vertices.collect) 
   println(s"User $id is ${property.name} and is liked by ${property.inDeg} people.")
  • PageRank(PR)算法

    用于评估网页链接的质量和数量,以确定该网页的重要性和权威性的相对分数,范围为0到10
class Graph[VD, ED] {
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
}
//tol收敛时允许的误差,越小越精确, 确定迭代是否结束的参数
//resetProb 随机重置概率

在这里插入图片描述

  • Pregel

  • Google提出的用于大规模分布式图计算框架(图遍历(BFS)单源最短路径(SSSP)PageRank计算

  • Pregel的计算由一系列迭代组成,称为supersteps

  • Pregel迭代过程

     每个顶点从上一个superstep接收入站消息
     计算顶点新的属性值
     在下一个superstep中向相邻的顶点发送消息
     当没有剩余消息时,迭代结束
    
  • Pregel API

class Graph[VD, ED] {  
    def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
      vprog: (VertexID, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]
}

//initialMsg:在“superstep 0”之前发送至顶点的初始消息
//maxIterations:将要执行的最大迭代次数
//activeDirection:发送消息方向(默认是出边方向:EdgeDirection.Out)
//vprog:用户定义函数,用于顶点接收消息
//sendMsg:用户定义的函数,用于确定下一个迭代发送的消息及发往何处
//mergeMsg:用户定义的函数,在vprog前,合并到达顶点的多个消息

  • 使用Pregel计算单源最短路径

    求从0到任意点的最短路径(SSSP)
    在这里插入图片描述
// 创建顶点集RDD
val vertices = sc.parallelize(Array((0L,""),(1L,""),(2L,""),(3L,""),(4L,""),(5L,"")))
// 创建边集RDD
val relationships = sc.parallelize(Array(Edge(0L, 1L, 100.0), Edge(0L, 2L, 30.0), Edge(0L, 4L, 10.0),
Edge(2L, 1L, 60.0), Edge(2L, 3L, 60.0),  Edge(3L, 1L, 10.0),Edge(4L, 3L, 50.0)))
val g = Graph(vertices,relationships)
val sourceId: VertexId = 0
val initialGraph = g.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
	(id, dist, newDist) => math.min(dist, newDist),
	triplet => {
		if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
		Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
		} else {
		Iterator.empty
		}},
	(a, b) => math.min(a, b)
)
 sssp.vertices.collect().foreach(println)
=>运行结果
	(4,10.0)
	(0,0.0)
	(1,70.0)
	(3,60.0)
	(5,Infinity)
	(2,30.0)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值