WebCollector爬虫的各种参数配置（代理、断点等）

最新推荐文章于 2024-08-08 08:32:52 发布

原创最新推荐文章于 2024-08-08 08:32:52 发布 · 6.2k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#webcollector #JAVA爬虫 #爬虫 #代理 #断点

webcollector 专栏收录该内容

45 篇文章

订阅专栏

本文以BreadthCrawler为例，介绍WebCollector爬虫的配置，包括设置CrawlPath用于存储爬取记录，以及如何启用断点续爬，确保在相同CrawlPath下实现爬虫的连续运行。

BreadthCrawler是WebCollector最常用的爬取器之一，依赖文件系统进行爬取信息的存储。这里以BreadthCrawler为例，对WebCollector的爬取配置进行描述：

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.model.Page;
import java.net.InetSocketAddress;
import java.net.Proxy;


public class MyCrawler extends BreadthCrawler{

    /*在visit方法里定义自己的操作*/
    @Override
    public void visit(Page page) {
        System.out.println("URL:"+page.getUrl());
        System.out.println("Content-Type:"+page.getResponse().getContentType());
        System.out.println("Code:"+page.getResponse().getContentType());
        System.out.println("-----------------------------");
    }
    
    public static void main(String[] args) throws Exception{
        MyCrawler crawler=new MyCrawler();
        
        /*配置爬取合肥工业大学网站*/
        crawler.addSeed("http://www.hfut.edu.cn/ch/");
        crawler.addRegex("http://.*hfut\\.edu\\.cn/.*");
        
        /*设置保存爬取记录的文件夹*/
        crawler.setCrawlPath("crawl_hfut");
        
        /*设置线程数*/
        crawler.setThreads(50);
        
        /*设置爬虫是否为断点爬取*/
        crawler.setResumable(false);
        
        /*设置代理服务器*/
        Proxy proxy=new Proxy(Proxy.Type.HTTP, new InetSocketAddress("14.18.16.67",80));
        crawler.setProxy(proxy);
        
        /*设置User-Agent*/
        crawler.setUseragent("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0");
        
        /*设置Cookie*/
        crawler.setCookie("......");
        
        /*进行深度为5的爬取*/
        crawler.start(5);
    }
  
}

这里解释一下，setCrawlPath是BreadthCrawler特有的，用于设定存储爬取记录的文件夹，如果不指定，默认使用crawl文件夹作为爬取记录文件夹。

如果使用断点模式，要保证同一个爬虫的爬取使用相同的CrawlPath，因为爬取记录就是靠CrawlPath存储的。