目标:爬取知乎首页前x个问题的详情及问题指定范围内的答案的摘要
power by:
- Python 3.6
- Scrapy 1.4
- json
- pymysql
项目地址:https://github.com/Dengqlbq/ZhiHuSpider.git
Step 1——相关简介
本文将注意力放在代码实现上,代码思路的描述将另开一文
代码思路:http://blog.youkuaiyun.com/sinat_34200786/article/details/78568894
Step 2——模拟登录
知乎如果不登录是爬取不到信息的,所以首先要做的就是模拟登录
主要步骤:
获取xsrf及验证码图片
填写验证码提交表单登录
登录是否成功
获取xsrf及验证码图片:
def start_requests(self):
yield scrapy.Request('https://www.zhihu.com/', callback=self.login_zhihu)
def login_zhihu(self, response):
""" 获取xsrf及验证码图片 """
xsrf = re.findall(r'name="_xsrf" value="(.*?)"/>', response.text)[0]
self.headers['X-Xsrftoken'] = xsrf
self.post_data['_xsrf'] = xsrf
times = re.findall(r'<script type="text/json" class="json-inline" data-n'
r'ame="ga_vars">{"user_created":0,"now":(\d+),', response.text)[0]
captcha_url = 'https://www.zhihu.com/' + 'captcha.gif?r=' + times + '&type=login&lang=cn'
yield scrapy.Request(captcha_url, headers=self.headers, meta={
'post_data': self.post_data},
callback=self.veri_captcha)
填写验证码提交表单登录:
def veri_captcha(self, response):
""" 输入验证码信息进行登录 """
with open('captcha.jpg', 'wb') as f:
f.write(response.body)
print('只有一个倒立文字则第二个位置为0')
loca1 = input('input the loca 1:')
loca2 = input('input the loca 2:')
captcha = self.location(int(loca1), int(loca2))
self.post_data = response.meta.get('post_data', {
})
self.post_data['captcha'] = captcha
post_url = 'https://www.zhihu.com/login/email'
yield scrapy.FormRequest(post_url, formdata=self.post_data, headers=self.headers,
callback=self.login_success)
def location</