Python爬虫（从requests到scrapy

「已注销」

已于 2022-03-18 19:06:46 修改

阅读量969

点赞数 6

分类专栏：爬虫文章标签： python 爬虫 scrcpy selenium

于 2021-10-14 22:33:07 首次发布

本文链接：https://blog.youkuaiyun.com/peachhhh/article/details/120774251

版权

本文详细介绍了Python爬虫从基础到高级的演变过程，包括requests模块的使用，如UA伪装，再到聚焦爬虫、验证码识别、代理及异步爬虫的概念与实现。进一步探讨了selenium模块，讲解了iframe处理、动作链和无头浏览器的运用。最后深入Scrapy框架，阐述了其核心组件、数据持久化存储、全站数据请求和中间件的使用，提供了丰富的代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬虫

作者：Ychhh_

文章目录

爬虫

铺垫内容

爬虫分类

通用爬虫：
- 抓取系统重要组成部分
聚焦爬虫：
- 建立在通用爬虫的基础之上
- 抓取的为抓取页面局部内容
增量式爬虫：
- 检测网站中数据的更新情况

反爬机制

门户网站，可以通过指定相应的策略，防止爬虫程序进行数据的窃取
反反爬策略：破解反爬策略，获取数据

requests模块

requests作用

模拟浏览器发送请求

response返回种类：
- text：文本格式
- json：json对象
- content：图片格式

UA伪装(反爬机制)

门户网站若检测到请求载体为request而不是浏览器,则会使得拒绝访问

聚焦爬虫

数据解析分类

正则
bs4
xpath

bs4

数据解析原理
1. 标签定位
2. 提取标签属性中的数据值

bs4数据解析原理：

 1. 实例化beautysoup对象，并将源码数据加载到beautysoup中
 2. 通过调用beautysoup对象中相关属性和方法进行标签定位和数据提取

属性定位：
- soup.tagName:找到第一次出现的标签的属性
- soup.find():
  1. find(tagName):等同于soup.tagName
  2. find(tagName,class / attr / id …):按照属性进行定位
- soup.find_all():查找符合要求的所有标签（列表新式),也可以作为属性定位
- soup.select():
  1. 标签选择器
  2. 层级选择器：
  - 父标签 > 子标签（一个层即）
  - ‘ ’空格表示多个层即
- Attention:对于find和select的结果非同一对象
获取标签中的内容：
- soup.text
- soup.string
- soup.get_text()

代码样例（三国演义爬取）

import requests
import json
from bs4 import BeautifulSoup

if __name__ == "__main__":

    url = "https://www.shicimingju.com/book/sanguoyanyi.html"

    headers = {
     
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }

    response = requests.get(url=url,headers=headers)
    response.encoding = response.apparent_encoding #设置编码格式
    """
    其中 r.encoding 根据响应头中的 charset 判断网站编码，如果没有设置则默认返回 iso-8859-1 编码，而r.apparent_encoding
则通过网页内容来判断其编码。令r.encoding=r.apparent_encoding就不会出现乱码问题。
    """

    html = response.text

    # print(html)
    soup = BeautifulSoup(html,'lxml')
    muluList = soup.select(".book-mulu a")
    muluRecord = []
    for mulu in muluList:
        muluRecord.append(mulu.text)
    pageNum = len(muluRecord)
    dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html"
    for i,title in enumerate(muluRecord):
        dataUrl = dataTotalUrl%(i + 1)
        response = requests.get(url=dataUrl,headers=headers)
        response.encoding = response.apparent_encoding
        dataHtml = response.text

        dataSoup = BeautifulSoup(dataHtml,'lxml')


        data = dataSoup.find("div",class_="chapter_content").text
        data = data.replace("　　","\n")
        path = r"C:\Users\Y_ch\Desktop\spider_test\data\text\sanguo\\" + title + ".txt"
        with open(path,'w',encoding="utf-8") as fp:
            fp.write(data)
            print("第%d篇下载完毕"%(i + 1)

xpath

数据解析原理：
1. 实例化etree对象，且需要将页面源码数据加载到被解析对象中去
2. 调用etree中的方法，配合着etree中的xpath方法定位
解析方法：
1. 将本地的html源码数据加载到etree中
  - etree.parse(filepath)
2. 可以将互联网上获得的源码数据加载到etree中去
  - etree.HTML(text)
xpath使用：
- 绝对路径:/xx/xx/x
- 省略路径：//xx
- 属性定位：//tagName[@attr = “”]
- 索引定位：//tagName[@attr=""]/xx
- 重复索引：//tagName[@attr]//p[pos]
- 文本获取：//tagName/text()
- 属性获取：//tagName/@attr

代码样例（4K图片爬取）

import json
from lxml import etree
import requests

if __name__ == "__main__":
    url = "https://pic.netbian.com/4kdongman/index_%d.html"

    headers = {
     
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    pageNum = 2

    for page in range(pageNum):
        if page == 0:
            new_url = "https://pic.netbian.com/4kdongman/"
        else:
            new_url = url % (page + 1)

        response = requests.get(url=new_url, headers=headers)

        html_code = response.text

        tree = etree.HTML(html_code)

        urlList = tree.xpath("//ul[@class=\"clearfix\"]//img//@src")

        urlHead = "https://pic.netbian.com"
        path = r"C:\Users\Y_ch\Desktop\spider_test\data\pic\4K\\"
        picUrlList = []
        for index, eachUrl in enumerate(urlList):
            picUrl = urlHead + eachUrl
            picUrlList.append(picUrl)

        for index, picUrl in enumerate(picUrlList):
            picReq = requests.get(url=picUrl, headers=headers)
            pic = picReq.content

            picPath = path + str(page)+ "." +str(index) + ".jpg"
            with open(picPath, 'wb') as fp:
                fp.write(pic)
                print("第%d页 第%d张图片下载成功!" % ((page + 1),index + 1))