scrapy-redis断点续爬，持久化爬虫和url去重，爬取京东图书

最新推荐文章于 2025-06-02 21:50:45 发布

原创最新推荐文章于 2025-06-02 21:50:45 发布

· 3.4k 阅读

19 ·

版权

文章标签：

#scrapy-redis

爬虫专栏收录该内容

4 篇文章

订阅专栏

scrapy

scrapy框架是专门为python爬虫所设计的框架，它可以实现多线程爬虫，异步请求运行，虽然不用scrapy框架也可以实现多线程爬虫，但是功能非常的鸡肋，也比较麻烦，而scrapy就可以很简单的实现了多线程爬虫，还有许多强大的功能，不懂的也可以取scrapy中文网上面了解 https://yiyibooks.cn/zomin/Scrapy15/index.html

scrapy-redis

scrapy-redis是基于scrapy开发的一个功能，它可以实现断点续爬，url去重，持久化爬虫，分布式爬虫，而且使用也非常简单，要先去官网把源码下载下来https://github.com/rmax/scrapy-redis，而主要文件只有三个
在这里插入图片描述
而这次比较重要的文件就是scrapy_redis这个文件，下面进行演示scrapy-redis爬取京东图书信息

建立scrapy爬虫项目

1、首先在你要建立的目录下面打开电脑终端，就是用cmd打开，然后建立爬虫项目，在电脑终端里面写上 scrapy startproject jingdong 后面这个jingdong就是项目名称

2、然后cd进去这个项目里面建立爬虫文件，在终端上面写 scrapy genspider jdbook jd.com 其中jdbook 就是爬虫文件名，jd.com就是爬虫的域名范围，我们建立爬虫文件时一定要给一个域名范围，防止爬虫到一些乱七八糟的url上。

3、要把scrapy-redis官网下载的源码中的scrapy_redis文件复制到项目当中去，因为有很多访问要调用里面的函数

开始配置setting

1、首先要配置的第一步就是ROBOTSTXT_OBEY，这个要设置为False，原本是为True，如果这个为True的话，就会在爬取网站之前会先去网站的根目录下面寻找一个robots.txt文件，如果找不到就不会往下面运行
在这里插入图片描述
2、模拟浏览器，就是要设置User-Agent和一些请求头

原本这些都是注释起来的，现在把他们都打开然后把值改为浏览器的值，这里我们就不用设置cookies如果要设置cookies和代理ip就要在下载中间键里面设置，也就是在middlewares.py这个文件里面设置，今天爬虫京东图书不用设置这些，因为京东的反爬机制做得不是那么好，如果想加cookies和代理IP也可以自己添加

3、设置redis数据库和持久化爬虫的一些参数

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定义一个去重的类，用来将url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定队列
SCHEDULER_PERSIST = True  #将程序持久化保存

REDIS_URL = "redis://127.0.0.1:6379"

DUPEFILTER_CLASS是一个去重类，通过这个类就可以添加url，并把url进行去重，就是访问过的url不再访问，这里给大家简单地讲解一些去重规则，首先通过加密方法将url地址加密成一串字符串然后作为指纹，然后存储进redis数据库里面，然后新的url地址也要通过加密然后成一串字符串，如果数据库里面有这一个指纹就不会访问这个url地址了，如果没有就访问这个url，然后再把指纹存进数据库

SCHEDULER 这个是一个队列，通过队列的方式把访问url地址得到一个request对象，然后把request对象存进队列中，然后一个一个再把request对象提取出来，就是这里实现多线程爬虫

SCHEDULER_PERSIST 这个就是要将程序持久化保存，如果一个正常的scrapy爬虫，如果不传递这个参数，那么当程序结束了，也会把数据库清空，就达不到程序持久化保存，和断点续爬的效果

REDIS_URL 这个就是redis的地址

4、如果想要保存数据还要把数据管道给开了

ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

开始分析网站

首先我们进入到图书分类的官网
https://book.jd.com/booksort.html

在这里插入图片描述
然后我们可以看到每个图书大分类下面都有很多的小分类，然后我们可以按f12看一些数据在不在这个url地址里面

然后我们可以看到数据都在里面，包括大分类的名字和url地址和小分类的名字和url地址

提取图书大分类和小分类

首先提取大分类的名字

all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #获取大分类
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()

然后我们发现小分类是大分类的兄弟节点的下一个节点，那我们就可以这样提取
然后提取小分类的名字和url地址

class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟节点的方式来获取同一节点的下一个节点
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #获取小分类的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())

进入小分类的url里面，获取图书部分信息

然后我们就进入到小分类的详情页里面，查看里面有多少数据是我们要的
在这里插入图片描述
然后我们再进入到网页源码里面看一下有没有我们想要的数据

然后就可以发现里面有图书的名字、url地址、作者名、出版社的名字、店铺的名字和url地址、出版时间，这些数据都是我们想要，就先把他们提取出来，另外的图书价格和评论数可以去图书的详情页里面获得

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和换行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()

获得图书的价格和评论数

我们先分析一下价格和评论数在哪里，经过我的分析，我发现价格和评论数在图书的url里面没有，它是通过js生成的json数据，我们可以找一下它生成在哪里，经过我分析发现价格在这个网址里面https://p.3.cn/prices/mgets?type=1&skuIds=J_11892005346&pdtk=&pduid=1551774170386597393748&pdpin=&pdbp=0&callback=jQuery6622062&_=1560704913535
经过分析这个url地址有些参数是不需要的，最终的url是https://p.3.cn/prices/mgets?skuIds=J_11892005346
然后每本书的价格只要修改J_后面的参数就可以获得每本书的价格，而这个参数在小分类的url地址里面存在，只要把参数提取出来就可以获得响应的书本的价格，里面有书本的原价和购买价格

    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']

提取评论数

书本的评论数同样是生成了json数据，所以我们可以按照刚才寻找价格的方法来寻找商品的评论数，这里就不详细说了，不懂也可以来问我

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']

设置items

items是数据管道，数据管道的使用也是非常简单，先在数据管道里面设定一些参数

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

然后在爬虫文件里面导入这个文件的JingdongItem类，然后用一个参数接收，比如

item = JingdongItem()

记得这个参数是在爬虫文件里面设置的，下面提供源码，就可以知道在哪里设置的

pepelines 的设置

pipelines主要是用来保存数据的，这些数据可以保存在mysql数据库、mongodb数据库，或者保存在csv文件中，我这里是保存在csv文件上

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("开始运行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("写入成功")
        return item
    def close_spider(self,spider):
        print("运行结束")

注意

注意两点
1、域名范围，这里的小分类和图书信息的url地址的域名都是不同的，所以要在上面的allowed_domains添加域名进去

2、yield scrapy.Request的使用，这里面设置到一个参数就是meta，因为要传递值给下面的函数接着使用，所以要传递这个参数，举一个例子

yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

这里用到了copy模块的一个方法，就是deepcopy，这里要用deepcopy把数据给拷贝起来，然后传递给下面的函数，如果不把它给拷贝下来，就会导致数据错乱，因为scrapy是一个异步请求的过程，所以极有可能就是上面已经处理好第三个数据了，但是还没开始传递第一个数据，就会导致数据交叉在一起，所以先把数据给拷贝下来就不会导致数据错乱的情况发生

爬虫文件源码

import scrapy
from copy import deepcopy
import re
import json
from jingdong.items import JingdongItem
class JdbookSpider(scrapy.Spider):
    name = 'jdbook'
    allowed_domains = ['jd.com','p.3.cn','club.jd.com']
    start_urls = ['https://book.jd.com/booksort.html']

    def parse(self, response):
        all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #获取大分类
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()
            class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟节点的方式来获取同一节点的下一个节点
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #获取小分类的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
                yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和换行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
            data_sku = "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/@data-sku").extract())
            yield scrapy.Request(url="https://p.3.cn/prices/mgets?skuIds=J_{}".format(data_sku),callback=self.parse_dateli,meta={"item":deepcopy(item)})
        next_page_url = "https://list.jd.com"+ "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        judge = "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        if judge is not None:
            yield scrapy.Request(url=next_page_url,callback=self.next_parse,meta={"item":deepcopy(item)})
    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']
        id = js[0]["id"]
        id = "".join(re.findall(r'\d\.*\d*',id))
        yield scrapy.Request(url="https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}".format(id),callback=self.parse_comment,meta={"item":deepcopy(item)})

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']
        yield deepcopy(item)
        # print(deepcopy(item))

setting源码

# -*- coding: utf-8 -*-

# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jingdong'

SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定义一个去重的类，用来将url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定队列
SCHEDULER_PERSIST = True  #将程序持久化保存

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL = "DEBUG"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

REDIS_URL = "redis://127.0.0.1:6379"   #要把数据写入redis数据库还要添加这个参数

items源码

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

pipelines源码

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("开始运行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("写入成功")
        return item
    def close_spider(self,spider):
        print("运行结束")