Giraph添加应用程序Weakly Connected Components算法

最新推荐文章于 2025-06-04 09:58:03 发布

原创最新推荐文章于 2025-06-04 09:58:03 发布 · 4.8k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#Giraph #WCC #Weakly Connected Com

Giraph 专栏收录该内容

14 篇文章

订阅专栏

本人原创，转载请注明出处！本人QQ：530422429，欢迎大家指正、讨论。

目的：举例说明如何在Giraph中添加应用程序，以WCC（Weakly Connected Components）算法为例，描述怎么添加Vertex的子类，自定义输入输出格式和使用Combiner等。

背景：Giraph源码中自带有WCC算法，类为：org.apache.giraph.examples.ConnectedComponentsVertex，代码如下：

package org.apache.giraph.examples;

import org.apache.giraph.edge.Edge;
import org.apache.giraph.graph.Vertex;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;

import java.io.IOException;

/**
 * Implementation of the HCC algorithm that identifies connected components and
 * assigns each vertex its "component identifier" (the smallest vertex id
 * in the component)
 *
 * The idea behind the algorithm is very simple: propagate the smallest
 * vertex id along the edges to all vertices of a connected component. The
 * number of supersteps necessary is equal to the length of the maximum
 * diameter of all components + 1
 *
 * The original Hadoop-based variant of this algorithm was proposed by Kang,
 * Charalampos, Tsourakakis and Faloutsos in
 * "PEGASUS: Mining Peta-Scale Graphs", 2010
 *
 * http://www.cs.cmu.edu/~ukang/papers/PegasusKAIS.pdf
 */
@Algorithm(
    name = "Connected components",
    description = "Finds connected components of the graph"
)
public class ConnectedComponentsVertex extends Vertex<IntWritable,
    IntWritable, NullWritable, IntWritable> {
  /**
   * Propagates the smallest vertex id to all neighbors. Will always choose to
   * halt and only reactivate if a smaller id has been sent to it.
   *
   * @param messages Iterator of messages from the previous superstep.
   * @throws IOException
   */
  @Override
  public void compute(Iterable<IntWritable> messages) throws IOException {
    int currentComponent = getValue().get();

    // First superstep is special, because we can simply look at the neighbors
    if (getSuperstep() == 0) {
      for (Edge<IntWritable, NullWritable> edge : getEdges()) {
        int neighbor = edge.getTargetVertexId().get();
        if (neighbor < currentComponent) {
          currentComponent = neighbor;
        }
      }
      // Only need to send value if it is not the own id
      if (currentComponent != getValue().get()) {
        setValue(new IntWritable(currentComponent));
        for (Edge<IntWritable, NullWritable> edge : getEdges()) {
          IntWritable neighbor = edge.getTargetVertexId();
          if (neighbor.get() > currentComponent) {
            sendMessage(neighbor, getValue());
          }
        }
      }
      voteToHalt();
      return;
    }

    boolean changed = false;
    // did we get a smaller id ?
    for (IntWritable message : messages) {
      int candidateComponent = message.get();
      if (candidateComponent < currentComponent) {
        currentComponent = candidateComponent;
        changed = true;
      }
    }

    // propagate new component id to the neighbors
    if (changed) {
      setValue(new IntWritable(currentComponent));
      sendMessageToAllEdges(getValue());
    }
    voteToHalt();
  }
}

分析知：在compute()方法中，对第0次迭代做了优化，每个顶点先从自身和邻接顶点中找出最小的顶点ID值，然后把该最小值发送给所有的邻接顶点。后面每个超步中，先从收到的消息中找出最小值，若该最小值小于自身值，就把自身的值设为该最小值，同时把该最小值发送给所有的邻接顶点；若果大于，就不更新自身值和向外发送消息。最后把顶点voteToHalt，进入InActive状态。

继续添加WCC的原因：写最简单（未做优化）的WCC的代码。自带的WCC中I,V,M的类型均为IntWritable类型，对上百亿的大数据顶点不能满足需求，下面将修改为LongWritable类型，就要求自定义输入和输出的类型。同时会添加Combiner。修改步骤如下：

1. 首先自定义输入格式，添加类： org.apache.giraph.examples.LongLongNullTextInputFormat，I,V,E的类型依次为 LongWritable，LongWritable，NullWritable(表示没有权值)。图的输入格式为邻接表形式，以\t间隔。源码如下：

package org.apache.giraph.examples;

import java.io.IOException;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.giraph.conf.ImmutableClassesGiraphConfigurable;
import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
import org.apache.giraph.edge.Edge;
import org.apache.giraph.edge.EdgeFactory;
import org.apache.giraph.graph.Vertex;
import org.apache.giraph.io.formats.TextVertexInputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import com.google.common.collect.Lists;

/**
 * Input format for unweighted graphs with long ids and double vertex values
 */
public class LongLongNullTextInputFormat
    extends TextVertexInputFormat<LongWritable, LongWritable, NullWritable>
    implements ImmutableClassesGiraphConfigurable<LongWritable, LongWritable,
    NullWritable, Writable> {
  /** Configuration. */
  private ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,
      NullWritable, Writable> conf;

  @Override
  public TextVertexReader createVertexReader(InputSplit split,
                                             TaskAttemptContext context)
    throws IOException {
    return new LongLongNullLongVertexReader();
  }

  @Override
  public void setConf(ImmutableClassesGiraphConfiguration<LongWritable,
		  LongWritable, NullWritable, Writable> configuration) {
    this.conf = configuration;
  }

  @Override
  public ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,
      NullWritable, Writable> getConf() {
    return conf;
  }

  /**
   * Vertex reader associated with
   * {@link LongLongNullTextInputFormat}.
   */
  public class LongLongNullLongVertexReader extends
      TextVertexInputFormat<LongWritable, LongWritable,
          NullWritable>.TextVertexReader {
    /** Separator of the vertex and neighbors */
    private final Pattern separator = Pattern.compile("\t");

    @Override
    public Vertex<LongWritable, LongWritable, NullWritable, ?>
    getCurrentVertex() throws IOException, InterruptedException {
      Vertex<LongWritable, LongWritable, NullWritable, ?>
          vertex = conf.createVertex();

      String[] tokens =
          separator.split(getRecordReader().getCurrentValue().toString());
      List<Edge<LongWritable, NullWritable>> edges =
          Lists.newArrayListWithCapacity(tokens.length - 1);
      for (int n = 1; n < tokens.length; n++) {
        edges.add(EdgeFactory.create(
            new LongWritable(Long.parseLong(tokens[n])),
            NullWritable.get()));
      }

      LongWritable vertexId = new LongWritable(Long.parseLong(tokens[0]));
      vertex.initialize(vertexId, new LongWritable(), edges);

      return vertex;
    }

    @Override
    public boolean nextVertex() throws IOException, InterruptedException {
      return getRecordReader().nextKeyValue();
    }
  }
}

2. 自定义输出格式，添加类： org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat，最后只输出顶点ID和value，源码如下：

import java.io.IOException;

import org.apache.giraph.graph.Vertex;
import org.apache.giraph.io.formats.TextVertexOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

/**
 * Output format for vertices with a long as id, a double as value and
 * null edges
 */
public class VertexWithLongValueNullEdgeTextOutputFormat extends
    TextVertexOutputFormat<LongWritable, LongWritable, NullWritable> {
  @Override
  public TextVertexWriter createVertexWriter(TaskAttemptContext context)
    throws IOException, InterruptedException {
    return new VertexWithDoubleValueWriter();
  }

  /**
   * Vertex writer used with
   * {@link VertexWithLongValueNullEdgeTextOutputFormat}.
   */
  public class VertexWithDoubleValueWriter extends TextVertexWriter {
    @Override
    public void writeVertex(
        Vertex<LongWritable, LongWritable, NullWritable, ?> vertex)
      throws IOException, InterruptedException {
      StringBuilder output = new StringBuilder();
      output.append(vertex.getId().get());
      output.append('\t');
      output.append(vertex.getValue().get());
      getRecordWriter().write(new Text(output.toString()), null);
    }
  }
}

3. 继承Vertex类，添加类：org.apache.giraph.examples.WeaklyConnectedComponentsVertex ，覆写compute()方法，实现WCC算法。源码如下：

import java.io.IOException;

/**
 * Weakly Connected Components Algorithm
 * 
 * @author baisong
 *
 */
public class WeaklyConnectedComponentsVertex extends Vertex<LongWritable,
    LongWritable, NullWritable, LongWritable> {
  /**
   * Propagates the smallest vertex id to all neighbors. Will always choose to
   * halt and only reactivate if a smaller id has been sent to it.
   *
   * @param messages Iterator of messages from the previous superstep.
   * @throws IOException
   */
  @Override
  public void compute(Iterable<LongWritable> messages) throws IOException {
	  if(getSuperstep()==0) {
		  setValue(getId());
		}
	  long minValue=getValue().get();
	  for(LongWritable msg:messages) {
		  if(msg.get()<minValue) {
			  minValue=msg.get();
		  }
	  }
	  if(getSuperstep()==0 || minValue<getValue().get()) {
		  setValue(new LongWritable(minValue));
		  sendMessageToAllEdges(new LongWritable(minValue));
	  }
	  voteToHalt();    
  }
}

4. 自定义Combiner，添加类：org.apache.giraph.combiner.MinimumLongCombiner , 源码如下：

package org.apache.giraph.combiner;

import org.apache.hadoop.io.LongWritable;

/**
 * {@link Combiner} that finds the minimum {@link LongWritable}
 */
public class MinimumLongCombiner
    extends Combiner<LongWritable, LongWritable> {
  @Override
  public void combine(LongWritable vertexIndex, LongWritable originalMessage,
		  LongWritable messageToCombine) {
    if (originalMessage.get() > messageToCombine.get()) {
      originalMessage.set(messageToCombine.get());
    }
  }

  @Override
  public LongWritable createInitialMessage() {
    return new LongWritable(Long.MAX_VALUE);
  }
}

5. 至此代码添加完毕，需用把所有的修改的class文件放入giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar 包中。运行命令如下：
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-1 -w 2

#使用Combiner
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-2 -c org.apache.giraph.combiner.MinimumLongCombiner -w 2

完！
本人原创，转载请注明出处！本人QQ：530422429，欢迎大家指正、讨论。