1.导入依赖
<dependency>
<groupId>com.geccocrawler</groupId>
<artifactId>gecco</artifactId>
<version>1.0.8</version>
</dependency>
2.创建爬虫类
- 接口HtmlBean说明该爬虫是一个解析html页面的爬虫(gecco还支持json格式的解析)
- 注解@Gecco告知该爬虫匹配的url格式(matchUrl)和内容抽取后的bean处理类(pipelines处理类采用管道过滤器模式,可以定义多个处理类)。
import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.annotation.Gecco;
import com.geccocrawler.gecco.annotation.HtmlField;
import com.geccocrawler.gecco.annotation.Request;
import com.geccocrawler.gecco.annotation.Text;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.spider.HtmlBean;
import java.util.List;
/**
* @Auther: lianjc
* @Date: 2018/11/19 0019 09:54
* @Description:
*/
@Gecco(matchUrl = "https://blog.youkuaiyun.com/u013396133/article/details/84255590",pipelines = "testPip