继续之前的笔记。上节实现了数据爬取和导出文件。这节学点干的,模拟浏览器请求,对付拉钩的反爬策略,爬取二级页面,获取到具体的职位,薪资等数据。
我们上节爬取的是分类的内容,我们实际浏览网页也是点击分类进入二级页面看职位列表,上节爬取的链接,就是我们点击的那个链接,我们已拿到了:
现在我们点击Java进入二级页面,假如我们要获取如下信息:
使用cookie给爬虫做伪装,应付反爬策略
首先我们通过如下代码访问到二级页面:
yield scrapy.Request(url = jobUrl , callback=self.parse_url)
callback是回调函数,我们需要在下面实现这个方法,但是有一点我们需要提前完成,那就是攻克拉勾的反爬虫机制,我们通过设置cookie来完成这个功能,接下来先教大家获取到cookie。
在Java这个二级页面,F12调出开发者模式
,按F5刷新界面,如图,可以看到发了4个请求,第一个根据字面猜测应该是我们找的那个请求。
将header里的request的cookie复制出来先放到记事本中,我们需要对其处理后变成键值对的格式,放入scrapy中,完整代码如下:
# -*- coding: utf-8 -*-
import scrapy
from First.items import FirstItem
class SecondSpider(scrapy.Spider):
name = 'second'
allowed_domains = []
start_urls = ['https://www.lagou.com/']
cookie = {
"JSESSIONID": "ABAAABAAAGGABCB090F51A04758BF627C5C4146A091E618",
"_ga": "GA1.2.1916147411.1516780498",
"_gid": "GA1.2.405028378.1516780498",
"Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6": "1516780498",
"user_trace_token": "20180124155458-df9f65bb-00db-11e8-88b4-525400f775ce",
"LGUID": "20180124155458-df9f6ba5-00db-11e8-88b4-525400f775ce",
"X_HTTP_TOKEN": "98a7e947b9cfd07b7373a2d849b3789c",
"index_location_city": "%E5%85%A8%E5%9B%BD",
"TG-TRACK-CODE": "index_navigation",
"LGSID": "20180124175810-15b62bef-00ed-11e8-8e1a-525400f775ce",
"PRE_UTM": "",
"PRE_HOST": "",
"PRE_SITE": "https%3A%2F%2Fwww.lagou.com%2F",
"PRE_LAND": "https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2FJava%2F%3FlabelWords%3Dlabel",
"_gat": "1",
"SEARCH_ID": "27bbda4b75b04ff6bbb01d84b48d76c8",
"Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6": "1516788742",
"LGRID": "20180124181222-1160a244-00ef-11e8-a947-5254005c3644"
}
def parse(self, response):
for item in response.xpath('//div[@class="menu_box"]/div/dl/dd/a'):
jobClass = item.xpath