Scrapy框架中分两类爬虫,Spider类和CrawlSpider类。该案例采用的是CrawlSpider类实现爬虫进行全站抓取。
CrawlSpider是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取。
创建CrawlSpider模板:
scrapy genspider -t crawl spider名称 www.xxxx.com
LinkExtractors:Link Extractors 的目的是提取链接,调用的是extract_links(),其提供了过滤器(filter),以便于提取包括符合正则表达式的链接。 过滤器通过以下构造函数的参数配置:
- allow (a regular expression (or list of)) – 必须要匹配这个正则表达式(或正则表达式列表)的URL才会被提取。如果没有给出(或为空), 它会匹配所有的链接。
- deny (a regular expression (or list of)) – 与这个正则表达式(或正则表达式列表)的(绝对)不匹配的URL必须被排除在外(即不提取)。它的优先级高于
allow
的参数。如果没有给出(或None), 将不排除任何链接。 - allow_domains (str or list) – 单值或者包含字符串域的列表表示会被提取的链接的domains。
- deny_domains (str or list) – 单值或包含域名的字符串,将不考虑提取链接的domains。
- deny_extensions (list) – 应提取链接时,可以忽略扩展名的列表。如果没有给出, 它会默认为 scrapy.linkextractor 模块中定义的
IGNORED_EXTENSIONS
列表。 - restrict_xpaths (str or list) – 一个的XPath (或XPath的列表),它定义了链路应该从提取的响应内的区域。如果给定的,只有那些XPath的选择的文本将被扫描的链接。见下面的例子。
- tags (str or list) – 提取链接时要考虑的标记或标记列表。默认为
( 'a' , 'area')
。 - attrs (list) – 提取链接时应该寻找的attrbitues列表(仅在
tag
参数中指定的标签)。默认为('href')
。 - canonicalize (boolean) – 规范化每次提取的URL(使用scrapy.utils.url.canonicalize_url )。默认为
True
。 - unique (boolean) – 重复过滤是否应适用于提取的链接。
- process_value (callable) – 见:class:BaseSgmlLinkExtractor 类的构造函数
process_value
参数。
Rules:在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。
- callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。 注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。
- follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。
- process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
- process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)
以下展示一个爬取拉钩网的案例:
spider.pyi
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from LaGouSpider.items import LagouJobItemLoader, LagouspiderItem
import datetime
from LaGouSpider.utils.common import get_md5
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/']
headers = {
"HOST": "www.lagou.com",
"Referer": "https://www.lagou.com",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
}
custom_settings = {
"COOKIES_ENABLED": False,
"DOWNLOAD_DELAY": 1,
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=ABAAABAAAFCAAEGBC99154D1A744BD8AD12BA0DEE80F320; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; _ga=GA1.2.1111395267.1516570248; _gid=GA1.2.1409769975.1516570248; user_trace_token=20180122053048-58e2991f-fef2-11e7-b2dc-525400f775ce; PRE_UTM=; LGUID=20180122053048-58e29cd9-fef2-11e7-b2dc-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; X_HTTP_TOKEN=7e9c503b9a29e06e6d130f153c562827; _gat=1; LGSID=20180122055709-0762fae6-fef6-11e7-b2e0-525400f775ce; PRE_HOST=github.com; PRE_SITE=https%3A%2F%2Fgithub.com%2Fconghuaicai%2Fscrapy-spider-templetes; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4060662.html; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516569758,1516570249,1516570359,1516571830; _putrc=88264D20130653A0; login=true; unick=%E7%94%B0%E5%B2%A9; gate_login_token=3426bce7c3aa91eec701c73101f84e2c7ca7b33483e39ba5; LGRID=20180122060053-8c9fb52e-fef6-11e7-a59f-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516572053; TG-TRACK-CODE=index_navigation; SEARCH_ID=a39c9c98259643d085e917c740303cc7',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
}
rules = (
Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),
)
def parse_job(self, response):
#解析拉钩网的职位
item_loader = LagouJobItemLoader(item=LagouspiderItem(), response=response)
item_loader.add_css("title", ".job-name::attr(title)")
item_loader.add_value("url", response.url)
item_loader.add_value("url_object_id", get_md5(response.url))
item_loader.add_css("salary", ".job_request .salary::text")
item_loader.add_xpath("job_city", "//*[@class='job_request']/p/span[2]/text()")
item_loader.add_xpath("work_years", "//*[@class='job_request']/p/span[3]/text()")
item_loader.add_xpath("degree_need", "//*[@class='job_request']/p/span[4]/text()")
item_loader.add_xpath("job_type", "//*[@class='job_request']/p/span[5]/text()")
item_loader.add_css("tags", '.position-label li::text')
item_loader.add_css("publish_time", ".publish_time::text")
item_loader.add_css("job_advantage", ".job-advantage p::text")
item_loader.add_css("job_desc", ".job_bt div")
item_loader.add_css("job_address", ".work_addr")
item_loader.add_css("company_name", "#job_company dt a img::attr(alt)")
item_loader.add_css("company_url", "#job_company dt a::attr(href)")
item_loader.add_value("crawl_time", datetime.datetime.now())
job_item = item_loader.load_item()
return job_item
items.py
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from LaGouSpider.settings import SQL_DATETIME_FORMAT
class LagouJobItemLoader(ItemLoader):
#自定义Itemloader
default_output_processor = TakeFirst()
def remove_splash(value):
#去掉斜杠
return value.replace("/","")
def handle_jobaddr(value):
addr_list = value.split("\n")
addr_list = [item.strip() for item in addr_list if item.strip()!="查看地图"]
return "".join(addr_list)
class LagouspiderItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field()
salary = scrapy.Field()
job_city = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
work_years = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
degree_need = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
job_type = scrapy.Field()
publish_time = scrapy.Field()
job_advantage = scrapy.Field()
job_desc = scrapy.Field()
job_address = scrapy.Field(
input_processor=MapCompose(remove_tags, handle_jobaddr),
)
company_name = scrapy.Field()
company_url = scrapy.Field()
tags = scrapy.Field(
input_processor=Join(",")
)
crawl_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = """
insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
job_type, publish_time, job_advantage, job_desc, job_address, company_name, company_url,
tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
"""
params = (
self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
self["work_years"], self["degree_need"], self["job_type"],
self["publish_time"], self["job_advantage"], self["job_desc"],
self["job_address"], self["company_name"], self["company_url"],
self["tags"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)
return insert_sql, params
pipeline.py
from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.cursors
class LagouspiderPipeline(object):
def process_item(self, item, spider):
return item
class MysqlTwistedPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(clsc,setting):
dbparms = dict(
host =setting["MYSQL_HOST"],
db = setting["MYSQL_DBNAME"],
user = setting["MYSQL_USER"],
password = setting["MYSQL_PASSWORD"],
charset = 'utf8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode = True,
)
dbpool = adbapi.ConnectionPool("MySQLdb",**dbparms)
return clsc(dbpool)
def process_item(self, item, spider):
#使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert,item)
query.addErrback(self.handle_error,item,spider) #处理异常
def handle_error(self,failure,item,spider):
#处理异步插入的异常
print(failure)
def do_insert(self,cursor,item):
#执行具体的插入
# 根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql,params = item.get_insert_sql()
cursor.execute(insert_sql, params)