Python scrapy学习之爬取2k唯美壁纸详细过程笔记及讲解

最新推荐文章于 2022-03-25 23:50:16 发布

Sound_of_ Silence

最新推荐文章于 2022-03-25 23:50:16 发布

阅读量337

点赞数

CC 4.0 BY-SA版权

分类专栏： Python 爬虫

本文链接：https://blog.youkuaiyun.com/weixin_44521703/article/details/98240853

Python 同时被 2 个专栏收录

92 篇文章

订阅专栏

爬虫

33 篇文章

订阅专栏

本文分享了使用Scrapy爬虫抓取图片的详细步骤与心得，从创建爬虫到图片下载，包括规则设置、图片字段配置、下载管道自定义等关键环节。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Scrapy 学习爬取图片

Scrapy 爬取图片时候绕了很多圈子，才明白了走了很大的弯路，幸亏绕出来了（大话可能说得有点早~），赶紧记录一下心得体会：

创建爬虫时的参数选择：
- 一是普通创建爬虫法： scrapy genspider xxx xx.com, 这种创建的爬虫是基本爬虫，作为新手锻炼理解页面结构很有必要，但是比较基础；
- 二是规则爬虫（自己取的，大神别笑）： scrapy genspider -t crawl xx xxx.com, 这种爬虫创建后自带了链接提取器，可以在相当程度上避免了代码的重写，实际上scrapy帮我们做了这些工作，因为这些都是很routine 的。其创建时自动引入了两个类。另外可以看见多了一rules，这里就是我们需要编写适合的规则的地方，这里值得注意的是，只针对start_urls 及与其同级的页面的提取，下级页面就需要自己写callback规则提取。具体如下：
```
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class WeimeiSpider(CrawlSpider):
    name = 'weimei'
    allowed_domains = ['www.netbian.com']
    start_urls = ['http://www.netbian.com/weimei/']

    rules = (
        Rule(LinkExtractor(allow=r'/weimei/index_\d+\.htm'), follow=True),  # 直接放网址，不是正则提取
        # 主页翻页,完成，不需callback，需要follow
        Rule(LinkExtractor(allow=r'/desk/\d+\.htm'), callback= 'parse_item'),
        # 主页找详情页，需要callback，不需要follow
    )
```
爬虫主体的编写：
- 爬取图片主要是两个字段，一是images，一是 image_urls，分别用来存放图片名与图片url，另外，如果想要分文件夹存放，可以自己定义个字段如image_paths（这个字段scrapy不管）
```
class WallpaperItem(scrapy.Item):
    # image_urls = scrapy.Field()
    # images = scrapy.Field()
    # images_paths = scrapy.Field()
    pass
```
- 爬取图片时候一定要注意，除非你自己完全重写下载图片的整个过程，否则，使用的参数名，函数名与scrapy中提供的一模一样，一个字符都不能差（这就是我被坑的最惨的地方，没有带自己学的悲哀）,比如settings中：
  IMAGES_STORE = ‘D:\imageDownload’
  IMAGES_URLS_FIELD = ‘image_urls’
  
  这两个字段一个名称一个字符都不能差，具体为啥，可以看看源码
- 接下啦就是主体的源码了，我这里爬取的唯美壁纸，注意点有：
  - 规则的allow中是直接填写地址，而不是我一开始理解的(.*?)正则提取，不可以带上左右两边的内容，否则妨碍正则提取；
  - 主页翻页及主页找详情页在rules里面写，但是详情页需要构建parser解析，rules不管，又被栽了一下，哭~~
PipeLines的编写：
- pipelines的编写实际上就是采用scrapy的图片下载通道。
- 首先为了能够下载图像，必须pip install image，不过在安装scrapy时好像已经安装好了
- 为了启用下载图片管道，只需要在setting中开启 scrapy,pipeline.images.ImagesPipeline；这就是完全采用默认下载，可以下图片，但是图片名称将会是hash值，看起来乱七八糟
- 为了修正以上，一般是重写一个类，该类继承自ImagesPipeline，这样可以继续用其中绝大部分功能，只需要将特定部分修改即可。这里实际上我们需要修改的就是存储函数file_path(self, request, response=None, info=None):
- 值得一提的是，file_path最后，filename = ‘{0}.jpg’.format(file_name)，这里就是文件分级，如果想分文件夹存贮，只需改成filename = ‘{0}/{1}.jpg’.format(directy_name, file_name),，当然如果这样，需要先传入目录名，在items.py中也需要写响应的字段。

综上：

如下为pipelines.py的内容

import scrapy
from scrapy.pipelines.images import ImagesPipeline


class WallpaperPipeline(object):# 不要了
    def process_item(self, item, spider):
        return item


class PicsDownloadPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        # 下载图片,ImagePipeline根据image_urls中指定的url进行爬取，可以通过get_media_requests为每个url生成一个Request。
        yield scrapy.Request(item['image_urls'], meta={'item': item})

    #图片下载完毕后，处理结果会以二元组的方式返回给item_completed()函数。
    #这个二元组定义如下：(success, image_info_or_failure),
    #说实话，不会改就干脆别重写，确实需要重写时候再写，确保自己理解透了，所以我干脆把写好的注释掉不要了~
    # def item_completed(self, results, item, info):  # 判断是否下载成功
    #     image_paths = [x['path'] for ok, x in results if ok]
    #     if not image_paths:
    #         raise DropItem("Item contains no images")
    #     item['image_paths'] = image_paths
    #     return item

    def file_path(self, request, response=None, info=None):  # 图片存放，定义文件名与路径
        name = request.meta['item']['images']
        filename = 'full/{0}.jpg'.format(name)
        return filename

如下为weimei.py的内容（爬虫主文件）：

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import WallpaperItem
import logging

logger = logging.getLogger(__name__)


class WeimeiSpider(CrawlSpider):
    name = 'weimei'
    allowed_domains = ['www.netbian.com']
    start_urls = ['http://www.netbian.com/weimei/']

    rules = (
        Rule(LinkExtractor(allow=r'/weimei/index_\d+\.htm'), follow=True),  
        # 直接放网址，不是正则提取
        # 主页翻页,完成，不需callback，需要follow
        Rule(LinkExtractor(allow=r'/desk/\d+\.htm'), callback='parse_item'),
        # 主页找详情页，需要callback，不需要follow
    )

    def parse_item(self, response):  # 进入第一页，找下级页面地址
        sub_url = response.xpath('//*[@id="main"]/div[2]/div/p/a/@href').extract_first()
        if sub_url is not None:
            sub_url = 'http://www.netbian.com' + sub_url
            yield scrapy.Request(sub_url, callback=self.parse_detail)

    def parse_detail(self, response):  # 进入下级页面，获取pic地址与标题
        try:
            item = WallpaperItem()
            r = '<title>(.*?)高清大图预览\d{3,4}x\d{3,4}_唯美壁纸下载_彼岸桌面</title>'
            item['images'] = re.findall(r, response.body.decode('gbk'))[0]
            item['images'] = re.sub(r'\s|,|，', '', item['images'])
            item['image_urls'] = response.xpath(
                '//*[@id="main"]/table/tr/td/a/img/@src').extract_first(default=None)
            yield item

        except:
            pass

如下为settings.py 高亮内容：

最好给个delay，别爬太狠，与人方便自己方便

BOT_NAME = 'wallpaper'

SPIDER_MODULES = ['wallpaper.spiders']
NEWSPIDER_MODULE = 'wallpaper.spiders'
IMAGES_STORE = 'D:\\imageDownload'
IMAGES_URLS_FIELD = 'image_urls_'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
ITEM_PIPELINES = {'wallpaper.pipelines.PicsDownloadPipeline': 1, }
DOWNLOAD_DELAY = 1