Arachnid Web Spider Framework的文档只有一个英文的,如下
Arachnid Web Spider Framework
Description
Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed. Two example spider applications are included to illustrate how to use the framework.Warning
WARNING: A Web spider may put a large load on a server and a network. You may wish to do this by design - for instance when load testing YOUR server, using YOUR hosts and YOUR network. DO NOT use this software to place an excessive load on someone elses host and network resources without explicit permission!!Author
This software was written by Robert Platt.Use
- Build a Arachnid.jar file using build.xml and Ant. You can also build documentation using the 'docs' target.
- Add the jar file to your CLASSPATH
- Arachnid is an abstract base class that uses the "visitor" pattern. It has a "traverse()" method that walks through a Web site. For each (valid) page in the site it calls the abstract method handleLink(). You need to dervie a sub-class from Arachnid and define a handleLink() method. This will be called for each and every valid page in the Web site. A PageInfo object is passed to handleLink(). The PageInfo object contains useful information about the Web page. Four other methods must be defined:
- handleBadLink() - for processing an invalid URL
- handleNonHTMLlink() - for processing links to non-HTML resources
- handleExternalLink() - for processing links that are outside the Web site
- handleBadIO() - in the event of an I/O problem while attempting to process a Web page
- Compile your application and run it.
Example
The following code uses Arachnid to generate a (very simplistic) site map for a Web site.
import java.io.*;
import java.net.*;
import java.util.*;
import bplatt.spider.*;
public class SimpleSiteMapGen {
private String site;
private final static String header = "<html></html><head></head>
"; private final static String trailer = ""; public static void main(String[] args) { if (args.length != 1) { System.err.println("java SimpleSiteMapGen <url></url>"); System.exit(-1); } SimpleSiteMapGen s = new SimpleSiteMapGen(args[0]); s.generate(); } public SimpleSiteMapGen(String site) { this.site = site; } public void generate() { MySpider spider = null; try { spider = new MySpider(site); } catch(MalformedURLException e) { System.err.println(e); System.err.println("Invalid URL: "+site); return; } System.out.println(header); spider.traverse(); System.out.println(trailer); } } class MySpider extends Arachnid { public MySpider(String base) throws MalformedURLException { super(base); } protected void handleLink(PageInfo p) { String link = p.getUrl().toString(); String title = p.getTitle(); if (link == null || title == null || link.length() == 0 || title.length() ==0) return; else System.out.println(""+title+""); } protected void handleBadLink(URL url,URL parent, PageInfo p) { } protected void handleBadIO(URL url, URL parent) { } protected void handleNonHTMLlink(URL url, URL parent,PageInfo p) { } protected void handleExternalLink(URL url, URL parent) { } }
Availability
The Arachnid Web Spider framework is available via SourceForge. Follow this link to obtain the source code. If you don't already have a Java Virtual Machine, you can obtain one from Sun Microsystems.License
The Arachnid Web Spider framework is licensed under the GNU Public License. See GPL.txt for details. If you are unable or unwilling to abide by the terms of this license, please remove this code from your machine.Support
The Arachnid Web Spider framework is distributed AS IS, with NO SUPPORT.只有通过程序提供的3个例子来研究他的使用。
首先看到SimpleSiteMapGen.java的使用
这个文件实现了一个简单的取出对应网页的title的功能
main函数就一点点
public static void main(String[] args) {
// if (args.length != 1) {
// System.err.println("java SimpleSiteMapGen <url></url>");
// System.exit(-1);
// }
String url = "http://www.hnict.net";
SimpleSiteMapGen s = new SimpleSiteMapGen(url);
s.generate();
}
这个是我修改了一下、注释掉了判断条件,直接运行的一个函数。
SimpleSiteMapGen 的gennerate方法如下
public void generate() {
MySpider spider = null;
try { spider = new MySpider(site); }
catch(MalformedURLException e) {
System.err.println(e);
System.err.println("Invalid URL: "+site);
return;
}
System.out.println(header);
spider.traverse();
System.out.println(trailer);
}
主要完成两个功能,实例化一个MySpider对象,然后执行MySpider的traverse方法。
MySpider类实现了抽象类Arachnid.java。这个类是整个爬虫引擎的主要功能的抽象类。并且MySpider还实现了
handleLink()方法。这里面就是调用了PageInfo的getTitle()和getUrl()调用标题和网站的url 。