Request对象由start_requests()调用make_requests_from_url() 生成Request对象,要修改最初爬取某个网站的Request对象,可以重写start_requests()方法(但重写必须返回一个可迭代对象,一般为生成器,此方法只会在spider启动爬取并且未制定URL时调用一次)。例如,如果需要在启动时以POST登陆某个网站,可以写成:
class MySpider(scrapy.Spider):
name = ‘myspider’
def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
make_requests_from_url()接受一个URL并返回一个Request对象,默认情况下,该Request对象的回调函数为parse(),dont_filter参数设置为True(即不会除重)