Spark GraphX
图是由顶点集合(vertex)及顶点间的关系集合(边edge)组成的一种网状数据结构;通常表示为二元组:Gragh=(V,E)
-
GraphX特点
-
[1] 基于内存实现了数据的复用与快速读取
-
[2] 通过弹性分布式属性图(Property Graph)统一了图视图与表视图
-
[3] 与Spark Streaming、Spark SQL和Spark MLlib等无缝衔接
-
GraphX核心抽象
弹性分布式属性图(Resilient Distributed Property Graph)
顶点和边都带属性的有向多重图
一份物理存储,两种视图
对Graph视图的所有操作,最终都会转换成其关联的Table视图的RDD操作来完成
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD, ED]]
}
//导入
import org.apache.spark.graphx._
//创建顶点rdd
val vertices=sc.parallelize(Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
))
//创建edges边rdd
val edges=sc.parallelize(Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
))
//创建graph对象
val graph=Graph(vertices,edges)
//获取graph图对象的vertices信息
graph.vertices.collect
//获取graph图对象的edges信息
graph.edges.collect
class Graph[VD, ED] {
val numEdges: Long //边数量
val numVertices: Long //顶点数量
val inDegrees: VertexRDD[Int] //入度
val outDegrees: VertexRDD[Int] //出度
val degrees: VertexRDD[Int] //出度+入度
}
class Graph[VD, ED] {
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
}
val t1_graph = graph.mapVertices { case(vertextId, (name, age)) => (vertextId, name) }
val t2_graph = graph.mapVertices { (vertextId, attr) => (vertextId, attr._1) }
val t3_graph = graph.mapEdges(e =>e.attr*7.0))
- 结构算子
class Graph[VD, ED] {
def reverse: Graph[VD, ED] //改变边的方向
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean, //生成满足顶点与边的条件的子图
vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
}
val t1_graph = graph.reverse
val t2_graph = graph.subgraph(vpred=(id,attr)=>attr._2<65) //attr:(name,age)
- Join算子:从外部的RDDs加载数据,修改顶点属性
class Graph[VD, ED] {
def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD): Graph[VD, ED]
def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)
: Graph[VD2, ED]
}
val comps= sc.parallelize(Array((1L, "kgc.cn"), (2L, "berkeley.edu"), (3L, "apache.org")))
val c_graph = graph.joinVertices(comps)((id,v,c)=>(v._1+"@"+c,v._2))
c_graph.vertices.foreach(println)
=>运行结果
(4,(David,42))
(1,(Alice@kgc.cn,28))
(6,(Fran,50))
(3,(Charlie@apache.org,65))
(5,(Ed,55))
(2,(Bob@berkeley.edu,27))
//RDD中的顶点不匹配时,值为None
val c_graph = graph.outerJoinVertices(comps)((id,v,c)=>(v._1+"@"+c,v._2))
c_graph.vertices.foreach(println)
=>运行结果
(4,(David@None,42))
(1,(Alice@Some(kgc.cn),28))
(6,(Fran@None,50))
(3,(Charlie@Some(apache.org),65))
(5,(Ed@None,55))
(2,(Bob@Some(berkeley.edu),27))
【计算用户粉丝数量】
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
//修改顶点属性
val initialUserGraph= graph.mapVertices{
case (id, (name, age)) => User(name, age, 0, 0)
}
//将顶点入度、出度存入顶点属性中
val userGraph = initialUserGraph.outerJoinVertices(initialUserGraph.inDegrees) {
case (id, u, inDegOpt) => User(u.name, u.age, inDegOpt.getOrElse(0), u.outDeg)
}.outerJoinVertices(initialUserGraph.outDegrees) {
case (id, u, outDegOpt) => User(u.name, u.age, u.inDeg, outDegOpt.getOrElse(0))
}
//顶点的入度即为粉丝数量
for ((id, property) <- userGraph.vertices.collect)
println(s"User $id is ${property.name} and is liked by ${property.inDeg} people.")
class Graph[VD, ED] {
def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
}
//tol收敛时允许的误差,越小越精确, 确定迭代是否结束的参数
//resetProb 随机重置概率
-
Pregel
-
Google提出的用于大规模分布式图计算框架(
图遍历(BFS)
、单源最短路径(SSSP)
、PageRank计算
) -
Pregel的计算由一系列迭代组成,称为
supersteps
-
Pregel迭代过程
每个顶点从上一个superstep接收入站消息 计算顶点新的属性值 在下一个superstep中向相邻的顶点发送消息 当没有剩余消息时,迭代结束
-
Pregel API
class Graph[VD, ED] {
def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
vprog: (VertexID, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED]
}
//initialMsg:在“superstep 0”之前发送至顶点的初始消息
//maxIterations:将要执行的最大迭代次数
//activeDirection:发送消息方向(默认是出边方向:EdgeDirection.Out)
//vprog:用户定义函数,用于顶点接收消息
//sendMsg:用户定义的函数,用于确定下一个迭代发送的消息及发往何处
//mergeMsg:用户定义的函数,在vprog前,合并到达顶点的多个消息
// 创建顶点集RDD
val vertices = sc.parallelize(Array((0L,""),(1L,""),(2L,""),(3L,""),(4L,""),(5L,"")))
// 创建边集RDD
val relationships = sc.parallelize(Array(Edge(0L, 1L, 100.0), Edge(0L, 2L, 30.0), Edge(0L, 4L, 10.0),
Edge(2L, 1L, 60.0), Edge(2L, 3L, 60.0), Edge(3L, 1L, 10.0),Edge(4L, 3L, 50.0)))
val g = Graph(vertices,relationships)
val sourceId: VertexId = 0
val initialGraph = g.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist),
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}},
(a, b) => math.min(a, b)
)
sssp.vertices.collect().foreach(println)
=>运行结果
(4,10.0)
(0,0.0)
(1,70.0)
(3,60.0)
(5,Infinity)
(2,30.0)