本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。
目的:举例说明如何在Giraph中添加应用程序,以WCC(Weakly Connected Components)算法为例,描述怎么添加Vertex的子类,自定义输入输出格式和使用Combiner等。
背景:Giraph源码中自带有WCC算法,类为:org.apache.giraph.examples.ConnectedComponentsVertex,代码如下:
package org.apache.giraph.examples;
import org.apache.giraph.edge.Edge;
import org.apache.giraph.graph.Vertex;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import java.io.IOException;
/**
* Implementation of the HCC algorithm that identifies connected components and
* assigns each vertex its "component identifier" (the smallest vertex id
* in the component)
*
* The idea behind the algorithm is very simple: propagate the smallest
* vertex id along the edges to all vertices of a connected component. The
* number of supersteps necessary is equal to the length of the maximum
* diameter of all components + 1
*
* The original Hadoop-based variant of this algorithm was proposed by Kang,
* Charalampos, Tsourakakis and Faloutsos in
* "PEGASUS: Mining Peta-Scale Graphs", 2010
*
* http://www.cs.cmu.edu/~ukang/papers/PegasusKAIS.pdf
*/
@Algorithm(
name = "Connected components",
description = "Finds connected components of the graph"
)
public class ConnectedComponentsVertex extends Vertex<IntWritable,
IntWritable, NullWritable, IntWritable> {
/**
* Propagates the smallest vertex id to all neighbors. Will always choose to
* halt and only reactivate if a smaller id has been sent to it.
*
* @param messages Iterator of messages from the previous superstep.
* @throws IOException
*/
@Override
public void compute(Iterable<IntWritable> messages) throws IOException {
int currentComponent = getValue().get();
// First superstep is special, because we can simply look at the neighbors
if (getSuperstep() == 0) {
for (Edge<IntWritable, NullWritable> edge : getEdges()) {
int neighbor = edge.getTargetVertexId().get();
if (neighbor < currentComponent) {
currentComponent = neighbor;
}
}
// Only need to send value if it is not the own id
if (currentComponent != getValue().get()) {
setValue(new IntWritable(currentComponent));
for (Edge<IntWritable, NullWritable> edge : getEdges()) {
IntWritable neighbor = edge.getTargetVertexId();
if (neighbor.get() > currentComponent) {
sendMessage(neighbor, getValue());
}
}
}
voteToHalt();
return;
}
boolean changed = false;
// did we get a smaller id ?
for (IntWritable message : messages) {
int candidateComponent = message.get();
if (candidateComponent < currentComponent) {
currentComponent = candidateComponent;
changed = true;
}
}
// propagate new component id to the neighbors
if (changed) {
setValue(new IntWritable(currentComponent));
sendMessageToAllEdges(getValue());
}
voteToHalt();
}
}
分析知:在compute()方法中,对第0次迭代做了优化,每个顶点先从自身和邻接顶点中找出最小的顶点ID值,然后把该最小值发送给所有的邻接顶点。后面每个超步中,先从收到的消息中找出最小值,若该最小值小于自身值,就把自身的值设为该最小值,同时把该最小值发送给所有的邻接顶点;若果大于,就不更新自身值和向外发送消息。最后把顶点voteToHalt,进入InActive状态。
继续添加WCC的原因:写最简单(未做优化)的WCC的代码。自带的WCC中I,V,M的类型均为IntWritable类型,对上百亿的大数据顶点不能满足需求,下面将修改为LongWritable类型,就要求自定义输入和输出的类型。同时会添加Combiner。修改步骤如下:
1. 首先自定义输入格式,添加类: org.apache.giraph.examples.LongLongNullTextInputFormat,I,V,E的类型依次为 LongWritable,LongWritable,NullWritable(表示没有权值)。图的输入格式为邻接表形式,以\t间隔。源码如下:
package org.apache.giraph.examples;
import java.io.IOException;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.giraph.conf.ImmutableClassesGiraphConfigurable;
import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
import org.apache.giraph.edge.Edge;
import org.apache.giraph.edge.EdgeFactory;
import org.apache.giraph.graph.Vertex;
import org.apache.giraph.io.formats.TextVertexInputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import com.google.common.collect.Lists;
/**
* Input format for unweighted graphs with long ids and double vertex values
*/
public class LongLongNullTextInputFormat
extends TextVertexInputFormat<LongWritable, LongWritable, NullWritable>
implements ImmutableClassesGiraphConfigurable<LongWritable, LongWritable,
NullWritable, Writable> {
/** Configuration. */
private ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,
NullWritable, Writable> conf;
@Override
public TextVertexReader createVertexReader(InputSplit split,
TaskAttemptContext context)
throws IOException {
return new LongLongNullLongVertexReader();
}
@Override
public void setConf(ImmutableClassesGiraphConfiguration<LongWritable,
LongWritable, NullWritable, Writable> configuration) {
this.conf = configuration;
}
@Override
public ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,
NullWritable, Writable> getConf() {
return conf;
}
/**
* Vertex reader associated with
* {@link LongLongNullTextInputFormat}.
*/
public class LongLongNullLongVertexReader extends
TextVertexInputFormat<LongWritable, LongWritable,
NullWritable>.TextVertexReader {
/** Separator of the vertex and neighbors */
private final Pattern separator = Pattern.compile("\t");
@Override
public Vertex<LongWritable, LongWritable, NullWritable, ?>
getCurrentVertex() throws IOException, InterruptedException {
Vertex<LongWritable, LongWritable, NullWritable, ?>
vertex = conf.createVertex();
String[] tokens =
separator.split(getRecordReader().getCurrentValue().toString());
List<Edge<LongWritable, NullWritable>> edges =
Lists.newArrayListWithCapacity(tokens.length - 1);
for (int n = 1; n < tokens.length; n++) {
edges.add(EdgeFactory.create(
new LongWritable(Long.parseLong(tokens[n])),
NullWritable.get()));
}
LongWritable vertexId = new LongWritable(Long.parseLong(tokens[0]));
vertex.initialize(vertexId, new LongWritable(), edges);
return vertex;
}
@Override
public boolean nextVertex() throws IOException, InterruptedException {
return getRecordReader().nextKeyValue();
}
}
}
2. 自定义输出格式,添加类: org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat,最后只输出顶点ID和value,源码如下:
import java.io.IOException;
import org.apache.giraph.graph.Vertex;
import org.apache.giraph.io.formats.TextVertexOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
/**
* Output format for vertices with a long as id, a double as value and
* null edges
*/
public class VertexWithLongValueNullEdgeTextOutputFormat extends
TextVertexOutputFormat<LongWritable, LongWritable, NullWritable> {
@Override
public TextVertexWriter createVertexWriter(TaskAttemptContext context)
throws IOException, InterruptedException {
return new VertexWithDoubleValueWriter();
}
/**
* Vertex writer used with
* {@link VertexWithLongValueNullEdgeTextOutputFormat}.
*/
public class VertexWithDoubleValueWriter extends TextVertexWriter {
@Override
public void writeVertex(
Vertex<LongWritable, LongWritable, NullWritable, ?> vertex)
throws IOException, InterruptedException {
StringBuilder output = new StringBuilder();
output.append(vertex.getId().get());
output.append('\t');
output.append(vertex.getValue().get());
getRecordWriter().write(new Text(output.toString()), null);
}
}
}
3. 继承Vertex类,添加类:org.apache.giraph.examples.WeaklyConnectedComponentsVertex ,覆写compute()方法,实现WCC算法。源码如下:
import java.io.IOException;
/**
* Weakly Connected Components Algorithm
*
* @author baisong
*
*/
public class WeaklyConnectedComponentsVertex extends Vertex<LongWritable,
LongWritable, NullWritable, LongWritable> {
/**
* Propagates the smallest vertex id to all neighbors. Will always choose to
* halt and only reactivate if a smaller id has been sent to it.
*
* @param messages Iterator of messages from the previous superstep.
* @throws IOException
*/
@Override
public void compute(Iterable<LongWritable> messages) throws IOException {
if(getSuperstep()==0) {
setValue(getId());
}
long minValue=getValue().get();
for(LongWritable msg:messages) {
if(msg.get()<minValue) {
minValue=msg.get();
}
}
if(getSuperstep()==0 || minValue<getValue().get()) {
setValue(new LongWritable(minValue));
sendMessageToAllEdges(new LongWritable(minValue));
}
voteToHalt();
}
}
4. 自定义Combiner,添加类:org.apache.giraph.combiner.MinimumLongCombiner , 源码如下:
package org.apache.giraph.combiner;
import org.apache.hadoop.io.LongWritable;
/**
* {@link Combiner} that finds the minimum {@link LongWritable}
*/
public class MinimumLongCombiner
extends Combiner<LongWritable, LongWritable> {
@Override
public void combine(LongWritable vertexIndex, LongWritable originalMessage,
LongWritable messageToCombine) {
if (originalMessage.get() > messageToCombine.get()) {
originalMessage.set(messageToCombine.get());
}
}
@Override
public LongWritable createInitialMessage() {
return new LongWritable(Long.MAX_VALUE);
}
}
5. 至此代码添加完毕,需用把所有的修改的class文件放入giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar 包中。运行命令如下:
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-1 -w 2
#使用Combiner
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-2 -c org.apache.giraph.combiner.MinimumLongCombiner -w 2
完!
本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。