gecco爬虫多个HtmlBean 匹配同一个matchUrl的问题

两个爬虫HtmlBean如下:

第一个HtmlBean,获取小说内容

@Gecco(
		matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html",
		pipelines="xybwPipeline"
		)
/**
* 获取小说内容
*/
public class XYBW implements HtmlBean{
	/**
	 * 
	 */
	private static final long serialVersionUID = 2833184596055251729L;

	@RequestParameter
	private Long index;
	
	@Text
	@HtmlField(cssPath=".read_m > h1:nth-child(2) > a:nth-child(1)")
	private String bookName;
	@Text
	@HtmlField(cssPath=".ydleft > h2:nth-child(2)")
	private String chapterName;
	@Html
	@HtmlField(cssPath=".yd_text2")
	private String content;
	
	
	
	public Long getIndex() {
		return index;
	}
	public void setIndex(Long index) {
		this.index = index;
	}
	public String getBookName() {
		return bookName;
	}
	public void setBookName(String bookName) {
		this.bookName = bookName;
	}
	public String getChapterName() {
		return chapterName;
	}
	public void setChapterName(String chapterName) {
		this.chapterName = chapterName;
	}
	public String getContent() {
		return content;
	}
	public void setContent(String content) {
		if (content != null && !content.isEmpty()) {
			content = content.replaceAll(" ", "");
			content = content.replaceAll(" ", "");
			content = content.replaceAll("<br/>", "");
			content = content.replaceAll("<br>", "");
			content = content.replaceAll("\\n{2}", "\n");
			this.content = content;
		}else{
			this.content = "";			
		}
	}
}

第二个HtmlBean ,获取小说目录

@Gecco(
		matchUrl="http://www.xs2345.com/read/18/18914/0.html",
		pipelines="xybwIndexPipeline"
		)
public class XYBWIndex implements HtmlBean{
	private static final long serialVersionUID = 6065963771104230481L;

	@Text
	@HtmlField(cssPath=".ml_title > h1:nth-child(1)")
	private String bookName;
	
	@Text
	@HtmlField(cssPath=".ml_main > dl > dd > a")
	private List<String> chapterNameList;
	
	@Href(click=true)
	@HtmlField(cssPath=".ml_main > dl > dd > a")
	private List<String> chapterList;
	
	public String getBookName() {
		return bookName;
	}
	public void setBookName(String bookName) {
		this.bookName = bookName;
	}
	public List<String> getChapterNameList() {
		return chapterNameList;
	}
	public void setChapterNameList(List<String> chapterNameList) {
		this.chapterNameList = chapterNameList;
	}
	public List<String> getChapterList() {
		return chapterList;
	}
	public void setChapterList(List<String> chapterList) {
		this.chapterList = chapterList;
	}
	
}

注意相应的处理Pipeline,这里忽略不提

启动抓取

HttpRequest request_xybw = new HttpGetRequest();
		request_xybw.setUrl("http://www.xs2345.com/read/18/18914/0.html");
		request_xybw.setCharset("gbk");
		
		GeccoEngine.create()
		.classpath("com.xfire")
		.start(request_xybw)
		.thread(1)
		.interval(1000)
		.mobile(false)
		.start();

分析:

刚开始出现问题在于

XYBW 的

matchUrl="http://www.xs2345.com/read/18/18914/{index}.html"
XYBWIndex 的
matchUrl="http://www.xs2345.com/read/18/18914/0.html"
当运行时第一个HtmlBean被匹配后(就是
http://www.xs2345.com/read/18/18914/0.html

先被

http://www.xs2345.com/read/18/18914/{index}.html
匹配了,

),spider运行就结束了

所以本想获取小说目录的HtmlBean 没有被处理。

将XYBW 的matchUrl改成如下就解决了这个问题

matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html"
但我觉得更好的解决方法是将所有的匹配HtmlBean都处理,将Spider中单独获取一个匹配,改成获取所有匹配的数组

//匹配SpiderBean
			currSpiderBeanClass = engine.getSpiderBeanFactory().matchSpider(request);


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值