scrapy的基本用法----爬取天堂网图片

本文介绍了使用Scrapy框架爬取天堂网图片的基础步骤,讲解了scrapy的外部库、site-packages、downloadermiddlewares及useragent.py模块的用法,特别是UserAgentMiddleware中间件的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

scrapy的基本用法
1. 通过命令创建项目
scrapy startproject 项目名称
2. 用pycharm打开项目
3. 通过命令创建爬虫
scrapy genspider 爬虫名称 域名
4. 配置settings
robots_obey=False
Download_delay=0.5
Cookie_enable=False
5. 自定义UserAgentMiddleWare
可以直接粘现成的
或者自己通过研究源码实现
6.在items文件里定义数据模型
7. 开始解析数据
1) 先大致规划一下需要几个函数
2) 函数1跳转到函数2使用 yield scrapy.Request(url,callback,meta,dont_filter)
3). 将数据封装到items,记得yield item
8. 自定义pipelines将数据存储到数据库/文件中

举例:爬取天堂网图片
1.创建项目
  scrapy startproject
2.用pycharm打开项目
3.通过命令创建爬虫   在pycharm终端创建
  scrapy genspider ivsky ivsky.com
4. 配置settings
  以代码说明,没有说明的表示不用修改   共有四处需要修改的地方
# Obey robots.txt rules
# robots.txt 这个文件是一种协议用于告诉爬虫哪些网站你不能爬
# 例如:http://www.baidu.com/robots.txt
# 默认遵守,一般需要改为False
ROBOTSTXT_OBEY = False
# 放慢爬取速度,防止被发现
DOWNLOAD_DELAY = 0.5
#禁用cookie追踪
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
DOWNLOADER_MIDDLEWARES = {
    # 数字表示优先级,越小越先执行 如果填为None 表示不执行
    #自己定义的
   'IvskySpider.middlewares.UserAgentMiddleware': 543,
    #优先选择系统的user-agent  所以要将系统的参数改为None
    #ExternalLiberaries-->site-packages-->scrapy-->downloadermiddlewares-->useragent.py
    #根据以上路径找到系统定义的有关User-Agent类UserAgentMiddleware
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
5.自定义UserAgentMiddleWare
  在middlewares.py文件中写入类UserAgentMiddleware,  此类可先根据下面路径从文件里粘贴,然后再根据需要修改
  具体路径:

    ExternalLiberaries-->site-packages-->scrapy-->downloadermiddlewares-->useragent.py

  此处将middlewares.py文件中所有的代码都展示了出来,类UserAgentMiddleware放在最后,其他的类都是自动生成的

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from fake_useragent import UserAgent


class IvskyspiderSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IvskyspiderDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


# 写随机的User-Agent类
class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self):
        self.user_agent = UserAgent()

    @classmethod
    def from_crawler(cls, crawler):
        o = cls()
        # crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    # 这个函数不能删掉,否则会报错
    def spider_opened(self, spider):
        # self.user_agent = getattr(spider, 'user_agent', self.user_agent)
        pass

    def process_request(self, request, spider):
        if self.user_agent:
            # b 改成二进制
            request.headers.setdefault(b'User-Agent', self.user_agent.random)
6.在items文件里定义数据模型
class ImgInfoItem(scrapy.Item):
    """
        item用于组装爬虫数据
        里面的字段根据实际情况定义即可,一般字段名和变量名保持一致
        item封装后的数据  可以经过管道做一些数据
    """
    big_title = scrapy.Field()
    small_title = scrapy.Field()
    thumb_src = scrapy.Field()
    thumb_alt = scrapy.Field()
    img_datail_src = scrapy.Field()
    path = scrapy.Field()
7. 开始解析数据
1) 先大致规划一下需要几个函数
2) 函数1跳转到函数2使用 yield scrapy.Request(url,callback,meta,dont_filter)
爬虫文件ivsky.py文件代码:
# -*- coding: utf-8 -*-
import scrapy

# 创建文件夹引入
import os

import requests

# .一个点表示当前文件所在文件夹  ..两个点文件夹的文件夹
# 从items文件中引入数据模型类
from ..items import ImgInfoItem

# scrapy框架自带一种解析方式,基于lxml,但不用框架写爬虫时,需要引入此包
# xpath解析网页数据时需要引入此包
# from lxml import  etree

class IvskySpider(scrapy.Spider):
    # 爬虫名称,尽量不要改,且必须是唯一的
    name = 'ivsky'
    # 爬虫的时候允许爬取的网站
    allowed_domains = ['ivsky.com']
    # 爬虫默认爬取的第一个网站,一般需要手动重新设置,可以存放多个网址
    start_urls = ['http://www.ivsky.com/tupian/ziranfengguang/']


    # 自动从start_urls中下载网址对应的网页,自动调用parse函数,参数后的response就是网页的对象
    # 自动调用  parse:解析
    def parse(self, response):
        # 这里的response跟之前我们写的不太一样,没有content属性
        # print(response.text)
        """

        参数1:url,爬取的网址
        参数2:callback回调,当网页下载好以后传给谁去解析
        参数3:method 默认get
        """
        #获取url地址之后会自动下载,然后将网页内容传给要传递的函数
        yield scrapy.Request(
            url=response.url,
            callback=self.parse_big_category,
            #scrapy会自动过滤已经爬过的文件,因此设置为True,表示不过滤
            dont_filter=True,

        )


    def parse_big_category(self,response):
        """
        解析网页大分类
        :param response:
        :return:
        """

        # print(response.text)
        # response.selector.xpath() 可以简写为response.xpath()
        # response.selector.css()
        # response.selector.re()
        big_a_list=response.xpath("//ul[@class='tpmenu']/li/a")
        for big_a in big_a_list[1:]:
            # 之前的xpath返回的列表,列表里放的是字符串,在这里的xpath返回的是个列表,但列表里放的不是字符创
            # extract_first:将列表中的元素转换为字符串并且取第0个,如果取不到返回默认值
            big_title = big_a.xpath("text()").extract_first("没有标题")
            big_href = big_a.xpath("@href").extract_first("没有地址")
            big_href="http://www.ivsky.com"+big_href
            # print(big_title,big_href)
            yield scrapy.Request(
                url=big_href,
                callback=self.parse_small_category,
                # 负责传递数据
                meta={
                    "big_title":big_title
                },
                dont_filter=True,
            )


    def parse_small_category(self,response):
        """
        解析网页小分类
        :param response:
        :return:
        """
        # big_title=response.meta("big_title")
        small_a_list=response.xpath("//div[@class='sline']/div/a")
        for small_a in small_a_list:
            small_title = small_a.xpath("text()").extract_first("没有标题")
            # 在原有meta的基础上往其字典里新增一个字段
            response.meta['small_title']=small_title
            small_href = small_a.xpath("@href").extract_first("没有地址")
            small_href = "http://www.ivsky.com" + small_href
            # print(small_title,small_href)
            yield scrapy.Request(
                url=small_href,
                callback=self.parse_img_list,
                meta=response.meta,
                dont_filter=True,
            )


    def parse_img_list(self,response):
        """
        解析图片缩略图
        :param response:
        :return:
        """
        img_a_list=response.xpath("//ul[@class='pli']/li/div/a")
        for img_a in img_a_list:
            detail_href = img_a.xpath("@href").extract_first("没有地址")
            detail_href="http://www.ivsky.com"+detail_href
            thumb_src = img_a.xpath("img/@src").extract_first("没有图片地址")
            thumb_alt = img_a.xpath("img/@alt").extract_first("没有图片名称")
            response.meta['thumb_src']=thumb_src
            response.meta['thumb_alt']=thumb_alt
            # print(thumb_alt,thumb_src)
            yield scrapy.Request(
                url=detail_href,
                callback=self.parse_img_detail,
                meta=response.meta,
                dont_filter=True,
            )


    def parse_img_detail(self,response):
        """
        解析图片详情
        :param response:
        :return:
        """
        #从meta中取出数据
        big_title = response.meta.get('big_title')
        small_title = response.meta.get('small_title')
        thumb_src = response.meta.get('thumb_src')
        thumb_alt = response.meta.get('thumb_alt')
        img_datail_src=response.xpath("//img[@id='imgis']/@src").extract_first("没有详细地址")
        print(img_datail_src)
        path = 'img/'+big_title+'/'+small_title+'/'+thumb_alt
        if not os.path.exists(path):
            os.makedirs(path)
        picture_name = thumb_src.split('/')[-1]
        thumb_name = "缩略图"+picture_name
        datail_name = "高清图"+picture_name

        #定义对象
        item=ImgInfoItem()
        #将数据封装到item中
        item["big_title"] = big_title
        item["small_title"] = small_title
        item["thumb_src"] = thumb_src
        item["thumb_alt"] = thumb_alt
        item["img_datail_src"] = img_datail_src
        item["path"] = path
        # 类似于return 如果用return,效果没问题,但是return以后的代码不执行
        # 但yield可返回数据,并且之后的代码照样执行
        yield item

        with open(path+'/'+thumb_name,"wb") as f:
            # thumb_name 缩略图网址
            img_response = requests.get(thumb_src)
            f.write(img_response.content)
        with open(path+'/'+datail_name,"wb") as f:
            # img_name 高清图网址
            img_response = requests.get(img_datail_src)
            f.write(img_response.content)


"""
scrapy默认支持四种数据格式,分别是:.json .csv...
操作:scrapy crawl 爬虫名称 -o 文件名
"""



"""
    进程:
    线程:与进程相关,一个进程相当于一个人,一个线程相当于一个脑子,默认只有一个线程,叫做主线程
          一个线程每次只能做一件事,例如:迅雷 同时下载任务量为1
          多个线程可以“同时”做多件事,例如:迅雷同时下载量改为5
          注意:线程不是越多越好,例如火车站卖票  窗口越多成本越高,速度越快
          注意:Scrapy线程默认用的是多线程
          注意:python没有多线程
    协程:协助线程的资源切换
"""
8.运行文件 run.py
from scrapy import cmdline
# cmdline.execute(['scrapy','crawl','ivsky'])
#img.scv容易乱码
# cmdline.execute(['scrapy','crawl','ivsky','-o','img.csv'])
# img.json数据格式固定
# cmdline.execute(['scrapy','crawl','ivsky','-o','img.json'])

#经编码生成json类型文件
cmdline.execute(['scrapy', 'crawl', 'ivsky', '-o', 'img.json', '-s', 'FEED_EXPORT_ENCODING=utf-8'])
#经编码生成csv类型文件
# cmdline.execute(['scrapy', 'crawl', 'ivsky', '-o', 'img.csv', '-s', 'FEED_EXPORT_ENCODING="gb18030"'])




 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值