基于python的scrapy爬虫，关于增量爬取是怎么处理的？

yoggie尤

于 2024-10-21 20:45:00 发布

阅读量799

点赞数 3

CC 4.0 BY-SA版权

文章标签： python scrapy 爬虫

本文链接：https://blog.youkuaiyun.com/yjq125931902/article/details/143115740

在当今大数据时代，网络爬虫成为了数据采集的重要工具。特别是在Python生态系统中，Scrapy框架因其高效、灵活的特点而备受开发者青睐。然而，随着数据量的不断增长，如何有效地进行增量爬取成为了一个重要的课题。本文将深入探讨基于Python的Scrapy爬虫如何实现增量爬取，并提供一些实用的技巧和最佳实践。

什么是增量爬取？

增量爬取（Incremental Crawling）是指在已经完成初次全量爬取的基础上，仅爬取新增或更新的数据。这种方式不仅能够减少不必要的资源消耗，还能提高爬取效率，确保数据的实时性和准确性。

增量爬取的优势

节省资源：避免重复爬取已有的数据，减少网络请求和存储开销。
提高效率：专注于新数据，加快爬取速度。
数据实时性：及时获取最新数据，保持数据的时效性。

Scrapy中的增量爬取实现方法

1. 使用`Redis`作为去重存储

Scrapy本身提供了多种去重机制，但默认的去重机制是基于内存的，这在大规模爬取时可能会导致内存溢出。使用Redis作为去重存储可以有效解决这个问题。

安装依赖

pip install scrapy-redis

配置`settings.py`

# 启用Redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 启用Redis去重过滤器
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 指定Redis连接参数
REDIS_URL = 'redis://localhost:6379'

2. 记录已爬取的URL

通过记录已爬取的URL，可以在下次爬取时跳过这些URL，从而实现增量爬取。

示例代码

import scrapy
from scrapy_redis.spiders import RedisSpider

class IncrementalSpider(RedisSpider):
    name = 'incremental_spider'
    redis_key = 'incremental:start_urls'

    def __init__(self, *args, **kwargs):
        super(IncrementalSpider, self).__init__(*args, **kwargs)
        self.crawled_urls = set()

    def parse(self, response):
        # 记录已爬取的URL
        self.crawled_urls.add(response.url)

        # 处理页面内容
        item = {}
        item['title'] = response.xpath('//title/text()').get()
        yield item

        # 提取新的URL
        for next_url in response.css('a::attr(href)').getall():
            if next_url not in self.crawled_urls:
                yield response.follow(next_url, self.parse)

3. 使用时间戳或版本号

对于某些网站，可以通过检查页面的时间戳或版本号来判断是否需要重新爬取。这种方法适用于那些定期更新内容的网站。

示例代码

import scrapy
from datetime import datetime

class TimestampSpider(scrapy.Spider):
    name = 'timestamp_spider'
    start_urls = ['http://example.com']

    def __init__(self, *args, **kwargs):
        super(TimestampSpider, self).__init__(*args, **kwargs)
        self.last_crawled_time = self.get_last_crawled_time()

    def get_last_crawled_time(self):
        # 从数据库或文件中读取上次爬取的时间
        return datetime(2023, 1, 1)

    def parse(self, response):
        last_modified = response.headers.get('Last-Modified')
        if last_modified:
            last_modified = datetime.strptime(last_modified.decode(), '%a, %d %b %Y %H:%M:%S %Z')
            if last_modified > self.last_crawled_time:
                # 页面有更新，爬取数据
                item = {}
                item['title'] = response.xpath('//title/text()').get()
                yield item

        # 提取新的URL
        for next_url in response.css('a::attr(href)').getall():
            yield response.follow(next_url, self.parse)

4. 使用数据库记录状态

通过将爬取的状态记录在数据库中，可以在每次启动爬虫时检查数据库中的记录，从而决定是否需要爬取某个URL。

示例代码

import scrapy
import sqlite3

class DatabaseSpider(scrapy.Spider):
    name = 'database_spider'
    start_urls = ['http://example.com']

    def __init__(self, *args, **kwargs):
        super(DatabaseSpider, self).__init__(*args, **kwargs)
        self.conn = sqlite3.connect('crawled_urls.db')
        self.cursor = self.conn.cursor()
        self.create_table()

    def create_table(self):
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS crawled_urls (
                url TEXT PRIMARY KEY,
                last_crawled TIMESTAMP
            )
        ''')
        self.conn.commit()

    def is_crawled(self, url):
        self.cursor.execute('SELECT 1 FROM crawled_urls WHERE url = ?', (url,))
        return self.cursor.fetchone() is not None

    def mark_as_crawled(self, url):
        self.cursor.execute('INSERT OR REPLACE INTO crawled_urls (url, last_crawled) VALUES (?, ?)', (url, datetime.now()))
        self.conn.commit()

    def parse(self, response):
        if not self.is_crawled(response.url):
            # 页面未爬取，爬取数据
            item = {}
            item['title'] = response.xpath('//title/text()').get()
            yield item
            self.mark_as_crawled(response.url)

        # 提取新的URL
        for next_url in response.css('a::attr(href)').getall():
            yield response.follow(next_url, self.parse)