Scrapy学习笔记（3）爬取知乎首页问题及答案

最新推荐文章于 2025-03-24 21:18:51 发布

原创

最新推荐文章于 2025-03-24 21:18:51 发布 · 4.8k 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#python #scrapy-爬虫

目标：爬取知乎首页前x个问题的详情及问题指定范围内的答案的摘要

power by:

Python 3.6
Scrapy 1.4
json
pymysql

项目地址：https://github.com/Dengqlbq/ZhiHuSpider.git

Step 1——相关简介

本文将注意力放在代码实现上，代码思路的描述将另开一文
代码思路：http://blog.youkuaiyun.com/sinat_34200786/article/details/78568894

Step 2——模拟登录

知乎如果不登录是爬取不到信息的，所以首先要做的就是模拟登录
主要步骤：

获取xsrf及验证码图片
填写验证码提交表单登录
登录是否成功

获取xsrf及验证码图片：

def start_requests(self):
    yield scrapy.Request('https://www.zhihu.com/', callback=self.login_zhihu)

def login_zhihu(self, response):
    """ 获取xsrf及验证码图片 """
    xsrf = re.findall(r'name="_xsrf" value="(.*?)"/>', response.text)[0]
    self.headers['X-Xsrftoken'] = xsrf
    self.post_data['_xsrf'] = xsrf

    times = re.findall(r'<script type="text/json" class="json-inline" data-n'
                       r'ame="ga_vars">{"user_created":0,"now":(\d+),', response.text)[0]
    captcha_url = 'https://www.zhihu.com/' + 'captcha.gif?r=' + times + '&type=login&lang=cn'

    yield scrapy.Request(captcha_url, headers=self.headers, meta={
   
   'post_data': self.post_data},
                         callback=self.veri_captcha)

这里写图片描述

填写验证码提交表单登录：

def veri_captcha(self, response):
    """ 输入验证码信息进行登录 """
    with open('captcha.jpg', 'wb') as f:
        f.write(response.body)

    print('只有一个倒立文字则第二个位置为0')
    loca1 = input('input the loca 1:')
    loca2 = input('input the loca 2:')
    captcha = self.location(int(loca1), int(loca2))

    self.post_data = response.meta.get('post_data', {
   
   })
    self.post_data['captcha'] = captcha
    post_url = 'https://www.zhihu.com/login/email'

    yield scrapy.FormRequest(post_url, formdata=self.post_data, headers=self.headers,
                             callback=self.login_success)

def location</