nutch-1.7-学习笔记（1）-org.apache.nutch.crawl-ToolRunner

最新推荐文章于 2021-03-21 15:59:54 发布

原创最新推荐文章于 2021-03-21 15:59:54 发布 · 486 阅读

0 ·

CC 4.0 BY-SA版权

nutch 专栏收录该内容

6 篇文章

订阅专栏

本文深入探讨了ToolRunner在Nutch框架中的角色，解释了其如何作为整个框架的入口，通过解析命令行参数并调用Tool接口来执行特定任务。详细介绍了配置、接口及其在实际应用中的交互过程。

关于ToolRunner，先看nutch的代码：

public static void main(String args[]) throws Exception {
    Configuration conf = NutchConfiguration.create();
    int res = ToolRunner.run(conf, new Crawl(), args);
    System.exit(res);
  }

ToolRunner其实是整个nutch的入口，要想弄明白先来看看下面的东西（是Hadoop里的东西）：

Configurable接口：

public interface Configurable {
void setConf(Configuration conf);
  Configuration getConf();
}

这个接口只有两个方法setConf和getConf;

Configured实现了Configurable接口：

public class Configured implements Configurable {
  private Configuration conf;
    public Configured() {
    this(null);
  }

  public Configured(Configuration conf) {
    setConf(conf);
  }
 
  public void setConf(Configuration conf) {
    this.conf = conf;
  }
 public Configuration getConf() {
    return conf;
  }
}

接下来我们的主角Tool出现了，它作为接口继承了Configurable接口，只有run()方法：

public interface Tool extends Configurable {
  int run(String [] args) throws Exception;
}

再来看一下Toolrunner:

public class ToolRunner {
  public static int run(Configuration conf, Tool tool, String[] args)
  throws Exception{
    if(conf == null) {
     conf = new Configuration();
    }

    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    //set the configuration back, so that Tool can configure itself
    tool.setConf(conf);
    //get the args w/o generic hadoop args
    String[] toolArgs = parser.getRemainingArgs();
    return tool.run(toolArgs);   //注意这一行

  }
}

ToolRunner 在最后调用了Tool的run（）方法;

看完这个我们再来看看nutch的源码你就明白了：

public class Crawl extends Configured implements Tool

这个是Crawl的定义，它继承了Configured并作为Tool的接口：

在看main方法中的这句：

int res = ToolRunner.run(conf, new Crawl(), args);

再看上面ToolRunner的定义，你会发现ToolRunner的GenericOptionsParser方法会先处理hadoop的命令，在把剩下的命令交给tooArgs保存，后交付给tool的run方法，在nutch这里，Crawl是Tool接口的实现，new Crawl()之后，ToolRunner最终会把剩下的命令交给Crawl自己的run（）方法，最终nutch会从这个run方法开始：

public int run(String[] args) throws Exception {

。。。。。

}