Arachnid Web Spider Framework1

Arachnid Web Spider Framework的文档只有一个英文的,如下

Arachnid Web Spider Framework

Description

Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed. Two example spider applications are included to illustrate how to use the framework.

Warning

WARNING: A Web spider may put a large load on a server and a network. You may wish to do this by design - for instance when load testing YOUR server, using YOUR hosts and YOUR network. DO NOT use this software to place an excessive load on someone elses host and network resources without explicit permission!!

Author

This software was written by Robert Platt.

Use

  • Build a Arachnid.jar file using build.xml and Ant. You can also build documentation using the 'docs' target.
  • Add the jar file to your CLASSPATH
  • Arachnid is an abstract base class that uses the "visitor" pattern. It has a "traverse()" method that walks through a Web site. For each (valid) page in the site it calls the abstract method handleLink(). You need to dervie a sub-class from Arachnid and define a handleLink() method. This will be called for each and every valid page in the Web site. A PageInfo object is passed to handleLink(). The PageInfo object contains useful information about the Web page. Four other methods must be defined:

     

    • handleBadLink() - for processing an invalid URL
    • handleNonHTMLlink() - for processing links to non-HTML resources
    • handleExternalLink() - for processing links that are outside the Web site
    • handleBadIO() - in the event of an I/O problem while attempting to process a Web page

     

    Instantiate your sub-class and call traverse().
  • Compile your application and run it.

     

Example

The following code uses Arachnid to generate a (very simplistic) site map for a Web site.


import java.io.*;
import java.net.*;
import java.util.*;
import bplatt.spider.*;

public class SimpleSiteMapGen {
  private String site;
  private final static String header = "<html></html><head></head>
"; private final static String trailer = ""; public static void main(String[] args) { if (args.length != 1) { System.err.println("java SimpleSiteMapGen <url></url>"); System.exit(-1); } SimpleSiteMapGen s = new SimpleSiteMapGen(args[0]); s.generate(); } public SimpleSiteMapGen(String site) { this.site = site; } public void generate() { MySpider spider = null; try { spider = new MySpider(site); } catch(MalformedURLException e) { System.err.println(e); System.err.println("Invalid URL: "+site); return; } System.out.println(header); spider.traverse(); System.out.println(trailer); } } class MySpider extends Arachnid { public MySpider(String base) throws MalformedURLException { super(base); } protected void handleLink(PageInfo p) { String link = p.getUrl().toString(); String title = p.getTitle(); if (link == null || title == null || link.length() == 0 || title.length() ==0) return; else System.out.println(""+title+""); } protected void handleBadLink(URL url,URL parent, PageInfo p) { } protected void handleBadIO(URL url, URL parent) { } protected void handleNonHTMLlink(URL url, URL parent,PageInfo p) { } protected void handleExternalLink(URL url, URL parent) { } }

Availability

The Arachnid Web Spider framework is available via SourceForge. Follow this link to obtain the source code. If you don't already have a Java Virtual Machine, you can obtain one from Sun Microsystems.

License

The Arachnid Web Spider framework is licensed under the GNU Public License. See GPL.txt for details. If you are unable or unwilling to abide by the terms of this license, please remove this code from your machine.

Support

The Arachnid Web Spider framework is distributed AS IS, with NO SUPPORT.

SourceForge Logo

只有通过程序提供的3个例子来研究他的使用。

首先看到SimpleSiteMapGen.java的使用
这个文件实现了一个简单的取出对应网页的title的功能
main函数就一点点
public static void main(String[] args) {
//              if (args.length != 1) {
//                      System.err.println("java SimpleSiteMapGen <url></url>");
//                      System.exit(-1);
//              }
                String url = "http://www.hnict.net";
                SimpleSiteMapGen s = new SimpleSiteMapGen(url);
                s.generate();
        }
这个是我修改了一下、注释掉了判断条件,直接运行的一个函数。
SimpleSiteMapGen 的gennerate方法如下
public void generate() {
                MySpider spider = null;
                try { spider = new MySpider(site); }
                catch(MalformedURLException e) {
                        System.err.println(e);
                        System.err.println("Invalid URL: "+site);
                        return;
                }
                System.out.println(header);
                spider.traverse();
                System.out.println(trailer);
        }
主要完成两个功能,实例化一个MySpider对象,然后执行MySpider的traverse方法。
MySpider类实现了抽象类Arachnid.java。这个类是整个爬虫引擎的主要功能的抽象类。并且MySpider还实现了
handleLink()方法。这里面就是调用了PageInfo的getTitle()和getUrl()调用标题和网站的url 。

<wbr></wbr>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值