python爬虫<在Pipeline创建数据连接池批量写入MySQL>

本文介绍了数据库连接池的重要参数及其含义,展示了如何利用DBUtils.PooledDB在Python中配置和使用MySQL数据库连接池,包括连接池的初始化、批量数据插入及资源释放等操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

以下是连接池的参数说明:

 def __init__(self, creator,
            mincached=0, maxcached=0,
            maxshared=0, maxconnections=0, blocking=False,
            maxusage=None, setsession=None, reset=True,
            failures=None, ping=1,
            *args, **kwargs):
        """Set up the DB-API 2 connection pool.

        creator: either an arbitrary function returning new DB-API 2
            connection objects or a DB-API 2 compliant database module
        mincached: initial number of idle connections in the pool
            (0 means no connections are made at startup)
        maxcached: maximum number of idle connections in the pool
            (0 or None means unlimited pool size)
        maxshared: maximum number of shared connections
            (0 or None means all connections are dedicated)
            When this maximum number is reached, connections are
            shared if they have been requested as shareable.
        maxconnections: maximum number of connections generally allowed
            (0 or None means an arbitrary number of connections)
        blocking: determines behavior when exceeding the maximum
            (if this is set to true, block and wait until the number of
            connections decreases, otherwise an error will be reported)
        maxusage: maximum number of reuses of a single connection
            (0 or None means unlimited reuse)
            When this maximum usage number of the connection is reached,
            the connection is automatically reset (closed and reopened).
        setsession: optional list of SQL commands that may serve to prepare
            the session, e.g. ["set datestyle to ...", "set time zone ..."]
        reset: how connections should be reset when returned to the pool
            (False or None to rollback transcations started with begin(),
            True to always issue a rollback for safety's sake)
        failures: an optional exception class or a tuple of exception classes
            for which the connection failover mechanism shall be applied,
            if the default (OperationalError, InternalError) is not adequate
        ping: determines when the connection should be checked with ping()
            (0 = None = never, 1 = default = whenever fetched from the pool,
            2 = when a cursor is created, 4 = when a query is executed,
            7 = always, and all other bit combinations of these values)
        args, kwargs: the parameters that shall be passed to the creator
            function or the connection constructor of the DB-API 2 module

        """

在 pipeline中的代码,以下代码是以一千条数据为一批,在spider开启的时候创建连接池,spider关闭的时候,将可能不足一批的数据写入数据库同时关闭连接池:

# -*- coding: utf-8 -*-
import sys
import MySQLdb

default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
    reload(sys)
    sys.setdefaultencoding(default_encoding)
from DBUtils.PooledDB import PooledDB

class MySQLStorePipeline(object):
    pgs = []

    def open_spider(self, spider):
        self.pool = PooledDB(creator=MySQLdb, maxcached=5,maxshared=5, host='localhost', user='root', passwd='root', db='vboxdb', port=3306,
                             charset="utf8")
        self.conn = self.pool.connection()
        self.cursor = self.conn.cursor()
        # 清空表:
        self.cursor.execute("truncate table epg")
        self.conn.commit()

    # 批量插入
    def bulk_insert_to_mysql(self, bulkdata):
        try:
            print "inserting data in batch--->>>>>", len(self.pgs)
            sql = """INSERT INTO epg (date, channel, time, pg) VALUES (%s, %s, %s, %s)"""
            self.cursor.executemany(sql, bulkdata)
            self.conn.commit()
        except:
            self.conn.rollback()

    def process_item(self, item, spider):
        # print item['date'], item['channel'], item['time'], item['pg']
        self.pgs.append((item['date'], item['channel'], item['time'], item['pg']))
        if len(self.pgs) == 1000:
            self.bulk_insert_to_mysql(self.pgs)
            # 清空缓冲区
            del self.pgs[:]
        return item

    # spider结束
    def close_spider(self, spider):
        print "closing spider,last commit", len(self.pgs)
        self.bulk_insert_to_mysql(self.pgs)
        self.conn.commit()
        self.cursor.close()
        self.conn.close()
<think>我们使用Scrapy框架来爬取新浪新闻标题,并使用Redis实现分布式调度。这里将使用Scrapy-Redis库。 步骤: 1. 安装必要的库:scrapy, scrapy-redis, redis 2. 创建一个Scrapy项目 3. 配置Scrapy项目以使用Scrapy-Redis 4. 编写爬虫代码 5. 设置Redis服务器 6. 运行爬虫 注意:由于Scrapy-Redis默认使用Redis作为请求队列,因此我们需要安装并运行Redis服务器。 详细步骤: 1. 安装: 在PyCharm的终端中执行以下命令(或使用系统的命令行): ``` pip install scrapy scrapy-redis redis ``` 2. 创建Scrapy项目: 在PyCharm的终端中执行: ``` scrapy startproject sina_news cd sina_news scrapy genspider sina news.sina.com.cn ``` 3. 配置Scrapy项目: 修改 `sina_news/settings.py` 文件,添加以下配置: ```python # 启用Scrapy-Redis调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 确保所有爬虫通过Redis去重 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 不清除Redis队列,允许暂停/重新启动爬虫 SCHEDULER_PERSIST = True # 设置Redis连接 REDIS_HOST = 'localhost' # Redis服务器地址 REDIS_PORT = 6379 # Redis端口 # 设置爬虫的请求队列优先级 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 可选:设置管道,将爬取到的数据存储到Redis ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 } # 遵循robots.txt规则(根据需求设置) ROBOTSTXT_OBEY = False # 设置下载延迟,避免过快请求 DOWNLOAD_DELAY = 2 ``` 4. 编写爬虫代码: 修改 `sina_news/spiders/sina.py` 文件: ```python import scrapy from scrapy_redis.spiders import RedisSpider # 修改继承的类为RedisSpider class SinaSpider(RedisSpider): name = 'sina' allowed_domains = ['news.sina.com.cn'] # 删除start_urls,使用redis_key redis_key = 'sina:start_urls' # Redis列表的键,用于获取起始URL def parse(self, response): # 提取新闻标题 titles = response.xpath('//a[contains(@href, "news.sina.com.cn")]/text()').getall() for title in titles: # 清洗数据:去除空白 title_clean = title.strip() if title_clean: yield { 'title': title_clean, 'url': response.url } # 提取下一页链接(如果需要翻页) # 这里假设新浪新闻有翻页,根据实际情况调整 next_pages = response.xpath('//a[contains(text(), "下一页") or contains(text(), "更多")]/@href').getall() for next_page in next_pages: if next_page: yield response.follow(next_page, callback=self.parse) ``` 5. 安装并启动Redis服务器: - 下载Redis:https://redis.io/download - 解压并编译: ``` tar xzf redis-*.tar.gz cd redis-* make ``` - 启动Redis服务器(默认端口6379): ``` src/redis-server ``` - 或者使用Docker启动Redis。 6. 运行爬虫: 首先,将起始URL放入Redis: 在另一个终端,使用Redis客户端: ``` redis-cli > lpush sina:start_urls http://news.sina.com.cn/ ``` 然后,在项目目录下运行爬虫: ``` scrapy crawl sina ``` 7. 查看结果: 爬取的数据会存储在Redis中,我们可以通过Redis客户端查看: ``` redis-cli > keys * ``` 应该会看到一个键,比如 `sina:items`(这是RedisPipeline默认存储的键),使用`lrange`命令查看: ``` > lrange sina:items 0 -1 ``` 注意:新浪新闻的网页结构可能会变化,因此需要根据实际情况调整XPath表达式。 以上步骤即可实现使用PyCharm和Redis爬取新浪新闻标题。 关于Scrapy-Redis的缺点:由于它存储的是Request对象(包含大量信息),可能会占用较多Redis内存,并且可能影响速度。因此,确保Redis服务器有足够的内存,并且根据实际需求调整硬件配置。 相关问题: 1. 如何调整Scrapy-Redis的请求队列存储方式? 2. 如何避免被新浪新闻的反爬机制封禁? 3. 如何将爬取的数据存储到数据库(如MySQL)而不是Redis?</think>### 如何使用PyCharm和Redis实现爬取新浪新闻标题 #### 核心步骤 1. **环境准备** - 安装Python(3.6+)和PyCharm - 安装依赖库: ```bash pip install scrapy scrapy-redis redis requests ``` - 安装Redis服务器([官网下载](https://redis.io/download)或Docker部署) 2. **创建Scrapy项目** 在PyCharm终端执行: ```bash scrapy startproject sina_news cd sina_news scrapy genspider news_spider news.sina.com.cn ``` 3. **配置Scrapy-Redis** 修改 `sina_news/settings.py`: ```python # 启用Redis调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" SCHEDULER_PERSIST = True # 暂停后保留任务 # Redis连接设置 REDIS_HOST = 'localhost' REDIS_PORT = 6379 # 启用Redis管道存储数据 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 } ``` 4. **编写爬虫逻辑** 修改 `spiders/news_spider.py`: ```python from scrapy_redis.spiders import RedisSpider from scrapy import Request class NewsSpider(RedisSpider): name = 'news_spider' redis_key = 'sina:start_urls' # Redis中的起始URL键 def parse(self, response): # 提取新闻标题(根据实际页面结构调整XPath) titles = response.xpath('//h2[@class="news-title"]/text()').getall() for title in titles: yield {'title': title.strip()} # 自动翻页(示例逻辑) next_page = response.xpath('//a[@class="next-page"]/@href').get() if next_page: yield Request(url=next_page, callback=self.parse) ``` 5. **启动Redis并添加任务** - 启动Redis服务器:`redis-server` - 在Redis中添加起始URL(PyCharm终端执行): ```bash redis-cli lpush sina:start_urls "https://news.sina.com.cn/" ``` 6. **运行爬虫** ```bash scrapy crawl news_spider ``` 7. **获取结果** 从Redis中读取爬取结果: ```bash redis-cli > LRANGE news_spider:items 0 -1 # 查看所有标题数据 ``` #### 关键优化点 1. **反爬策略** - 在 `settings.py` 中添加: ```python DOWNLOAD_DELAY = 1 # 请求延迟 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...' ROTATING_PROXIES = [...] # 使用代理池 ``` 2. **分布式扩展** 在多台机器部署相同爬虫,连接同一Redis服务器即可自动分配任务。 3. **数据存储** 修改管道将数据写入数据库(如MySQL): ```python # pipelines.py import pymysql class MySQLPipeline: def process_item(self, item, spider): # 连接数据库并插入数据 db = pymysql.connect(host='localhost', user='root', password='', db='news') cursor = db.cursor() cursor.execute("INSERT INTO titles (title) VALUES (%s)", (item['title'],)) db.commit() ``` #### 注意事项 - 新浪新闻的页面结构可能变化,需定期更新XPath表达式。 - 避免高频请求,防止IP被封禁。 - Redis内存监控:定期清理历史数据(`redis-cli FLUSHDB`)。 > **引用说明**:Scrapy-Redis通过Redis存储Request对象实现分布式调度,但需注意其存储开销问题[^2]。 --- ### 相关问题 1. 如何动态调整Scrapy-Redis的请求优先级? 2. 新浪新闻的反爬机制有哪些?如何绕过验证码? 3. 除了Redis,还有哪些消息队列适合分布式爬虫(如Kafka/RabbitMQ)? 4. 如何将爬取的新浪新闻数据实时可视化? 5. Scrapy-Redis在哪些场景下可能不适合使用?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值