Nutch插件机制分析

最新推荐文章于 2015-07-02 12:27:00 发布

eckoqzhang

最新推荐文章于 2015-07-02 12:27:00 发布

阅读量490

点赞数

分类专栏： nutch

nutch 专栏收录该内容

5 篇文章

订阅专栏

http://blog.youkuaiyun.com/ruizema/article/details/6679220

引言

Nutch使用的插件机制是其所有功能的核心，所有的扩展功能包括页面分析parse、页面评分scoring、url过滤urlFilter、分词analyzer等搜索引擎的核心功能都是通过插件机制实现的。插件机制的有点有如下几点：

可扩展能力（Extensibility）

插件机制允许任何人扩展Nutch的功能，而且开发扩展的门槛很低，开发者仅仅需要实现特定的接口来实现自己的功能。
灵活性（Fexibility）

由于其可扩展能力非常强，几乎每个人都可以为nutch开发插件来扩展其功能。将会有越来越多的插件供我们选择和使用。每个基于nutch的应用不可能安装所有的插件，可能会根据不同的需求来选择需要或感兴趣的插件。nutch插件机制恰恰使的这种定制非常灵活，按需安装，并且安装过程非常简单，只需修改配置文件就能完成。
可维护性（Maintainability）

使每个开发者都能关注自己所在的领域和环境。核心开发者只需关心为nutch核心引擎提供描述良好的接口——“插口”。而对于一个插件开发者也能专心与扩展功能的开发而不必关心最后整个系统怎么工作。他们只需关心插件和插口的数据交换。这样核心开发者和插件开发者都能关注自己的领域而不必关心其他人的具体实现和系统的整合，这使得代码结构简单、健壮、易于维护。

以上的这些优点是每个优秀的软件系统都应该追求达到的目标，因此nutch的插件机制值得我们探究一番，学习其精髓。这也是我写本文的目的。

本文将主要分析org.apache.nutch.plugin包中各个类的关系以及相关的配置文件，了解插件的加载过程，最后如果可能的话使用设计模式的观点分析插件系统背后的设计思想（这是我的终极目标）。

正文

一.基本概念

本部分参考WhichTechnicalConceptsAreBehindTheNutchPluginSystem.介绍插件系统的几个基本概念。

Extension Point(扩展点)

扩展点是插件的插口，第三方开发者通过扩展点来实现扩展功能。扩展点定义一个接口（interface），扩展通过实现扩展点定义的接口扩展其功能，一个扩展点可能对应多个不通的扩展。
Extension（扩展）

扩展和对应的扩展点匹配，是第三方对扩展点的功能实现。扩展必须实现扩展点定义的接口并返回扩展点定义的数据格式。
Plugin（插件）

插件是一个或多个扩展实现的集合，插件通过多个扩展实现某一特定目标功能。插件包括一个或多个扩展、依赖的插件和类库以及自身发布的库（插件的具体实现jar包），插件可被终止和启动。
Plugin manifest（插件描述文件）

每个插件都必须对应一个插件描述文件即插件目录下的plugin.xml文件。该文件包含了插件的一些元数据并表述插件的基本信息，包括插件中的扩展实现的扩展点、扩展具体实现类、插件依赖的类库和插件依赖的其他插件等信息。后面将分析plugin.xml文件。
plugin repository（插件仓库）

插件仓库在nutch运行生命周期中维护系统所有插件信息，是nutch插件系统的核心。系统启动时，插件仓库获取所有插件配置文件，判断系统要加载的插件，在仓库中注册各个插件的扩展点以及对应的扩展。

二.插件的内部结构

如上所述，插件是一个实现某一特定功能的扩展集合。其结构如下：

图1插件的内部结构

插件包括以下几个属性：

runtime属性描述了运行此插件功能需要的jar包和发布的jar包
requires 属性描述了插件依赖的其他插件
extension-point 属性描述插件实现的扩展点
extension 属性描述扩展点的具体实现

以writingPluginExample-1.2 中的插件为例对应以上几个属性分析插件描述文件,此插件实现了三个扩展点。

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="recommended"//这个plugin的ID
   name="Recommended Parser/Filter"//这个plugin的名字
   version="0.0.1"//plugin的版本号
   provider-name="nutch.org">//plugin提供者名字

   <runtime>//属性描述了其需要的 Jar 包，和发布的 Jar 包
      <!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
      <library name="recommended.jar">//实现此插件的功能需要recommende.jar这个包，也即插件本身实现
         <export name="*"/>//发布的Jar包
      </library>
   </runtime>
   <!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
        any recommended meta tags →
	//一个plugin可以有多个extension，实现多个扩展。extension 属性则描述了扩展点的实现
   <extension id="org.apache.nutch.parse.recommended.recommendedfilter"//扩展ID
              name="Recommended Parser"//扩展名
              point="org.apache.nutch.parse.HtmlParseFilter">//此扩展实现的扩展点ID
      <implementation id="RecommendedParser"//扩展具体实现ID
                      class="org.apache.nutch.parse.recommended.RecommendedParser"/>//扩展具体实现的类
   </extension>
   <!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents
        of the recommended meta tags (as found by the RecommendedParser) to the lucene
        index. -->
   <extension id="org.apache.nutch.parse.recommended.recommendedindexer"
              name="Recommended identifier filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="RecommendedIndexer"
                      class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
   </extension>
   <!-- The RecommendedQueryFilter gets called when you perform a search. It runs a
        search for the user's query against the recommended fields.  In order to get
        add this to the list of filters that gets run by default, you have to use
        "fields=DEFAULT". -->
   <extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
              name="Recommended Search Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="RecommendedQueryFilter"
                      class="org.apache.nutch.parse.recommended.RecommendedQueryFilter">
        <parameter name="fields" value="recommended"/>//扩展的属性
        </implementation>
   </extension>
	//有的plugin还可能包含依赖插件，表示此插件依赖于另外的插件
   <requires>
        <import plugin="nutch-extensionpoints"/> //依赖的插件
    </requires>
</plugin>

插件内部结构是插件开发者需要关注的部分，插件开发者实现插件并描述插件的各个属性，系统启动时nutch插件机制分析plugin.xml插件描述文件加载对应的扩展类。

三.nutch插件体系结构

这部分具体分析nutch插件机制的体系结构，即分析nutch插件机制如何维护插件、扩展点、扩展并加载插件。

对应第一部分的基本概念，在org.apache.nutch.plugin包中有如下类：

ExtensionPoint

ExtensionPoint包含以下属性

public class ExtensionPoint {

private String ftId;

private String fName;

private String fSchema;

private ArrayList<Extension> fExtensions;

…

}

ExtensionPoint为extension提供元信息（meta information）包括扩展点ID、扩展点名字、fschema扩展点xml纲要（不了解我看了很多扩展的加载过程，这个字段都为空，没看出是什么意思）和一组实现此扩展点的扩展 fExtensions，例如：

<extension-point

id=”org.apache.nutch.net.URLNormalizer” //扩展点ID

name=”Nutch URL Normalizer”/>//扩展点名字

Extension

Extension类的属性：

private PluginDescriptor fDescriptor;//该扩展所在属插件的描述信息

private String fId;//扩展id

private String fTargetPoint;//扩展实现的扩展点ID

private String fClazz;//实现该扩展的类名

private HashMap<String, String> fAttributes;//扩展的属性

private Configuration conf;//配置信息

private PluginRepository pluginRepository;//插件仓库，在获取实例扩展实例是调用插件仓库加载扩展并返回扩展实例

以上的各个属性可以对照上文的plugin.xml扩展描述文件中对extension属性的描述。

Plugin

Plugin类是一个虚拟的类，包含两个属性

private PluginDescriptor fDescriptor;//描述改插件

protected Configuration conf;//系统配置

和startUp、shutDown方法，这两个方都为空。在nutch wiki 中提到，插件可以扩展Plugin类实现startUp和shutDown来控制插件的激活和终止，但并不必须的。对于需要维护生命周期相关的交互——比如数据库链接——的插件就应该包括一个Plugin class扩展Plugin重写这两个方法。

PluginDescriptor

PluginDescriptor描述了一个Plugin的基本信息，包括plugin所在的位置fPluginPath、plugin类名fPluginClass、plugin名称fName、plugin的ID标识fPluginId、plugin的版本号fVersion、plugin的提供者fPvoviderName、该Plugin所依赖的插件fDependencies、需要导入的类库fExprotedLibs以及plugin实现的扩展fExtensions、plugin加载器fClassLoader。

PluginDescriptor中的属性对应了一个插件的描述文件plugin.xml中的各个属性

private String fPluginPath;

private String fPluginClass = Plugin.class.getName();

private String fPluginId;//

private String fVersion;//

private String fName;//

private String fProviderName;//

private HashMap fMessages = new HashMap();

private ArrayList<ExtensionPoint> fExtensionPoints = new ArrayList<ExtensionPoint>();

private ArrayList<String> fDependencies = new ArrayList<String>();

private ArrayList<URL> fExportedLibs = new ArrayList<URL>();

private ArrayList<URL> fNotExportedLibs = new ArrayList<URL>();

private ArrayList<Extension> fExtensions = new ArrayList<Extension>();//实现的扩展

private PluginClassLoader fClassLoader;

public static final Log LOG = LogFactory.getLog(PluginDescriptor.class);

private Configuration fConf;

PluginRepository

插件仓库类，包括所有插件的注册信息；系统启动时PluginRepository分析所有插件的描述文件建立插件仓库。对应每一个插件创建一个pluginDescriptor实例。pluginDescriptor包含对应插件的所有源信息。当需要某一个插件时，将通过这些信息加载插件，并创建插件实例。

private static final WeakHashMap<Configuration, PluginRepository> CACHE = new WeakHashMap<Configuration, PluginRepository>();//插件仓库内部的缓存

private boolean auto;//当被配置为过滤（即不加载），但是又被其他插件依赖的时候，是否自动启动，缺省为 true。

private List<PluginDescriptor> fRegisteredPlugins;//注册的插件集合

private HashMap<String, ExtensionPoint> fExtensionPoints;//所有的扩展点ID和扩展点

private HashMap<String, Plugin> fActivatedPlugins;//已经激活或加载的plugin 名和plugin对象

private Configuration conf;

public static final Log LOG = LogFactory.getLog(PluginRepository.class);

下面是各个类的关系图：

类结构关系

图2类结构关系

如上图所示，系统启动时pluginRepository加载所有关于插件配置的配置文件，包括nutch-default.xml各个插件的plugin.xml。获取如下的一些配置信息：

1.plugin.folders：插件所在的目录，缺省位置在 plugins 目录下。

<property>
    <name>plugin.folders</name>
    <value>plugins</value>
    <description>Directories where nutch plugins are located.  Each
    element may be a relative or absolute path.  If absolute, it is used
    as is.  If relative, it is searched for on the classpath.
    </description>
</property>

2.plugin.auto-activation：当被配置为过滤（即不加载），但是又被其他插件依赖的时候，是否自动启动，缺省为 true。

<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

3.plugin.includes：要包含的插件名称列表，支持正则表达式方式定义。

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)
    |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|
    urlnormalizer-(pass|regex|basic)
  </value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

4.plugin.excludes：要排除的插件名称列表，支持正则表达式方式定义。

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.  
  </description>
</property>

根据这些配置信息找到各个插件的plugin.xml为每个插件创建一个pluginDescriptor实例，记录该插件的基本信息。并注册各个扩展点和扩展点对应的扩展实现。但此时并没有实际加载插件到内存中，只是为运行时加载插件准备，当系统运行到某个功能时，比如需要使用URLNormalizer（这是系统定义的一个扩展点接口）时，将向pluginRepository获取对应扩展点，接着获取实现该扩展点的所有扩展，并获取各个扩展的实例，加载实例。

需要说明的是，一个插件plugin可能有多个扩展点和扩展，但是运行时是对应扩展来加载插件（plugin）的。即当需要调用某个扩展时，获取extensionPoint、加载扩展点extensionPoint对应的所有扩展，在加载每个扩展之前会先加载扩展所属的插件（plugin）。由于插件（plugin）可能包含多个扩展点（extensionPoint）的扩展（extension）。每个插件（plugin）在系统运行时只加载一次。

插件的加载过程如下图所示：

加载时序图

图3插件加载时序

通过序列图可以发现，Nutch 加载插件的过程需要 actor 全程直接调用每个关联对象，最终得到的是插件的实现对象。详细过程如下：

首先通过 PluginRepository.getConf() 方法加载配置信息，配置的内容包括插件的目录，插件的配置文件信息 plugin.properties 等，此时 pluginrepository 将根据配置信息加载各个插件的 plugin.xml，同时根据 Plugin.xml 加载插件的依赖类。
当 actor 需要加载某个扩展点的插件的时候，他可以：
1. 首先根据扩展点的名称，通过 PluginRepository 得到扩展点的实例，即 ExtensionPoint 类的实例；
2. 然后调用 ExtensionPoint 对象的 getExtensions 方法，返回的是实现此扩展点的实例列表（Extension[]）；
3. 对每个实现的扩展实例 Extension，调用它的 getExtensionInstance() 方法，以得到实际的实现类实例，此处为 Object；
4. 根据实际情况，将 Object 转型为实际的类对象类型，然后调用它们的实现方法，例如 helloworld 方法。

插件类加载机制

nutch使用一个PluginClassLoader的类完成动态加载插件的功能，每个Plugin都有自己的ClassLoader用于加载该插件需要的所有类，包括插件本身的扩展类、本地库和依赖插件等。PluginClassLoader继承与URLClassLoader，没有任何其他的动作。其初始话过程如下：

public PluginClassLoader getClassLoader() {

if (fClassLoader != null)

return fClassLoader;

ArrayList<URL> arrayList = new ArrayList<URL>();

arrayList.addAll(fExportedLibs);

arrayList.addAll(fNotExportedLibs);

arrayList.addAll(getDependencyLibs());

File file = new File(getPluginPath());

try {

for (File file2 : file.listFiles()) {

if (file2.getAbsolutePath().endsWith(“properties”))

arrayList.add(file2.getParentFile().toURL());

}

} catch (MalformedURLException e) {

LOG.debug(getPluginId() + ” “ + e.toString());

}

URL[] urls = arrayList.toArray(new URL[arrayList.size()]);

fClassLoader = new PluginClassLoader(urls, PluginDescriptor.class

.getClassLoader());

return fClassLoader;

}

在pluginDescriptor类中实现。当加载扩展时先调用如下过程加载该扩展所属的插件（在PluginRepository.getPluginInstance()函数中实现）：

PluginClassLoader loader = pDescriptor.getClassLoader();

Class pluginClass = loader.loadClass(pDescriptor.getPluginClass());

Constructor constructor = pluginClass.getConstructor(new Class[] {

PluginDescriptor.class, Configuration.class });

Plugin plugin = (Plugin) constructor.newInstance(new Object[] {

pDescriptor, this.conf });

plugin.startUp();

fActivatedPlugins.put(pDescriptor.getPluginId(), plugin);

return plugin;

再调用如下过程加载扩展本身（在Extension.getExtensionInstance()函数中实现）：

PluginClassLoader loader = fDescriptor.getClassLoader();

Class extensionClazz = loader.loadClass(getClazz());//加载扩展

// lazy loading of Plugin in case there is no instance of the plugin

// already.

this.pluginRepository.getPluginInstance(getDescriptor());//加载插件，确保plugin已激活

Object object = extensionClazz.newInstance();//生成扩展实例

if (object instanceof Configurable) {

((Configurable) object).setConf(this.conf);

}

return object;

不知道URLClassLoader怎么用的可以看附录。

四.设计模式

这是终极目标，等学完设计模式再完成这一部分。

总结

Nutch是一个非常出色的开源搜索引擎框架，他的插件机制非常值得我们学习，通过插件机制实现了系统的可扩展性、灵活性和可维护性，使得各个部分的开发人员只需关注自己的领域，不必去担心如何整合系统，也极大的提高了开发效率。

参考文献

http://wiki.apache.org/nutch/WhyNutchHasAPluginSystem

http://www.ibm.com/developerworks/cn/java/j-lo-nutchplugin/?S_TACT=105AGX52&S_CMP=tec-csdn

http://wiki.apache.org/nutch/WritingPluginExample-1.2

http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem

附录

使用URLClassLoader动态加载类

一般动态加载类都是直接用Class.forName()这个方法，但这个方法只能创建程序中已经引用的类，并且只能用包名的方法进行索引，比如Java.lang.String，不能对一个.class文件或者一个不在程序引用里的.jar包中的类进行创建。但使用URLClassLoader就可以直接根据创建一个单独的.class文件，并且每当重新载入后并实例化后都是最新的方法。类似于jsp，当你在eclipse中改了一个jsp并且存储后，只要刷新页面就可以得到最新的结果而不用重新启动服务器。

URLClassLoader是在java.net包下的一个类。他的构造函数输入参数需要输入1个URL数组。假设我们有一个编译后的class文件在C:\URLClass\testClass.class中，URLClassLoader进行加载，其中有个方法为test，我们需要动态加载这个类并且运行test方法

Java代码

File xFile=new File(“C:/URLClass”);

URL xUrl= xFile.toURL() ;

URLClassLoader ClassLoader=new URLClassLoader(new URL[]{ xUrl });

Class xClass=ClassLoader.loadClass(“testClass”);

Object xObject=xClass.newInstance();

Method xMethod=xClass.getDeclaredMethod(“test”);