从零构建Scrapy User-Agent池:3种高效策略让你告别IP封锁

第一章:Scrapy User-Agent池的核心价值

在构建高效、稳定的网络爬虫系统时,反爬机制的规避是关键挑战之一。User-Agent作为HTTP请求头的重要组成部分,常被网站用于识别客户端类型。单一或固定的User-Agent极易触发目标站点的封禁策略,导致IP被封锁或返回无效数据。通过引入User-Agent池机制,Scrapy爬虫能够在每次请求中随机切换不同的浏览器标识,显著降低被检测为自动化工具的风险。

提升请求多样性

随机化的User-Agent使爬虫流量更接近真实用户行为,有效模拟多设备、多浏览器访问场景,增强请求的合法性。

增强反反爬能力

结合动态IP代理与User-Agent轮换,可形成多维度伪装策略,大幅提升对抗复杂反爬系统的能力。 实现User-Agent池通常通过自定义下载器中间件完成。以下是一个典型中间件代码示例:
# middlewares.py
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RandomUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent_list):
        self.user_agent_list = user_agent_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            user_agent_list=crawler.settings.get('USER_AGENT_LIST')
        )

    def process_request(self, request, spider):
        if self.user_agent_list:
            ua = random.choice(self.user_agent_list)
            request.headers.setdefault('User-Agent', ua)
上述代码定义了一个中间件类,从配置列表中随机选取User-Agent并注入请求头。需在 settings.py 中启用该中间件并提供代理列表:
  1. 将中间件添加到 DOWNLOADER_MIDDLEWARES 配置项
  2. 在设置中定义 USER_AGENT_LIST 变量,包含多个主流浏览器标识
  3. 确保请求调度过程中中间件正确加载并执行
策略作用
固定User-Agent易被识别,风险高
随机User-Agent池提高隐蔽性,推荐使用

第二章:静态User-Agent池的构建与优化

2.1 理解User-Agent的作用与反爬机制

User-Agent(UA)是HTTP请求头中的关键字段,用于标识客户端的操作系统、浏览器类型及版本信息。服务器通过解析UA判断请求来源,进而区分正常用户与自动化爬虫。
常见User-Agent结构
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
该字符串包含平台(Windows NT 10.0)、设备架构(Win64; x64)、渲染引擎(AppleWebKit/537.36)及浏览器信息(Chrome/124.0.0.0),服务端可据此建立访问行为模型。
反爬中的UA检测机制
  • 空UA拦截:未设置UA的请求通常被判定为脚本行为
  • 黑名单过滤:识别已知爬虫工具如Scrapy、Python-urllib
  • 频率关联:结合IP与UA进行异常访问模式分析
为绕过限制,爬虫常采用UA池轮换策略,模拟多样化的用户环境。

2.2 收集高质量User-Agent列表的实践方法

公开数据源的整合利用
获取高质量User-Agent(UA)最直接的方式是整合权威公开数据源。常见来源包括设备厂商官网、浏览器文档、以及社区维护的开源项目,如GitHub上的ua-parser/uap-core
  • 定期抓取并解析移动设备数据库(如DeviceAtlas)
  • 订阅W3C标准中关于客户端提示头(Client Hints)的更新
  • 使用CDN日志聚合真实流量中的UA样本
自动化采集与清洗流程
通过爬虫模拟主流浏览器访问行为,收集响应头中的User-Agent,并进行标准化清洗。
# 示例:从HTTP响应头提取UA
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
response = requests.get('https://httpbin.org/headers', headers=headers)
collected_ua = response.json()['headers']['User-Agent']
该代码演示了主动请求服务端回显UA的过程,适用于构建真实环境下的UA样本库。需配合随机化请求间隔以避免被封禁。
有效性验证机制
建立规则过滤伪造或过时UA,例如检查格式合规性、匹配已知浏览器指纹特征,并结合在线验证API交叉校验。

2.3 在Scrapy中实现静态UA池的中间件逻辑

在Scrapy爬虫项目中,为避免被目标网站识别并封锁,使用静态用户代理(User-Agent)池是一种常见策略。通过自定义下载中间件,可实现请求头中UA的随机轮换。
中间件代码实现

import random
from scrapy import signals

class RandomUserAgentMiddleware:
    def __init__(self, user_agents):
        self.user_agents = user_agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(user_agents=crawler.settings.getlist('USER_AGENT_POOL'))

    def process_request(self, request, spider):
        if self.user_agents:
            ua = random.choice(self.user_agents)
            request.headers.setdefault('User-Agent', ua)
上述代码定义了一个中间件类,from_crawler 方法从配置中读取UA列表,process_request 在每次请求时随机设置User-Agent。
配置项说明
  • USER_AGENT_POOL:在 settings.py 中定义的UA字符串列表
  • setdefault:确保仅在未设置UA时才添加,保留个别请求的特殊配置

2.4 随机轮询策略的设计与性能评估

随机轮询策略是一种轻量级的负载均衡算法,适用于服务实例性能相近且请求处理时间波动较小的场景。其核心思想是从候选节点中随机选择一个目标进行请求分发,避免了加权算法的复杂性。
实现逻辑

func RandomSelect(servers []string) string {
    rand.Seed(time.Now().UnixNano())
    index := rand.Intn(len(servers))
    return servers[index]
}
上述代码通过生成一个介于 0 和服务器数量之间的随机索引,实现无偏选择。rand.Intn 确保分布均匀,适合高并发环境下的快速决策。
性能对比
策略吞吐量(QPS)延迟均值
随机轮询8,50012ms
轮询8,20013ms

2.5 静态池的维护与更新频率分析

静态池作为资源预分配的核心机制,其维护策略直接影响系统稳定性与性能开销。
更新触发机制
常见的更新方式包括定时轮询与事件驱动。定时更新可通过 cron 任务实现:
0 */6 * * * /opt/pool_updater --refresh-static-pool
该配置表示每6小时执行一次静态池刷新,适用于负载稳定场景。
维护频率权衡
过高频率增加系统负担,过低则导致数据陈旧。下表为不同更新间隔的性能对比:
更新间隔内存波动(±%)请求延迟均值(ms)
30分钟812.4
2小时314.1
6小时216.8
综合评估建议在业务低峰期每日更新一次,兼顾一致性与资源消耗。

第三章:动态生成User-Agent的技术路径

3.1 基于fake-useragent库的实时生成原理

fake-useragent 库通过抓取公开的浏览器用户代理数据库(如 Browscap、UAList),实现 User-Agent 的动态生成。其核心优势在于避免使用静态字符串导致的爬虫识别风险。

数据同步机制

库在首次运行时自动下载最新 UA 数据并缓存至本地,后续请求优先读取缓存文件,降低网络开销。

from fake_useragent import UserAgent

ua = UserAgent()  # 初始化时加载或更新缓存
print(ua.chrome)  # 随机获取一条 Chrome 浏览器的 User-Agent

上述代码初始化 UserAgent 对象时,会检查本地缓存有效性。若缓存缺失或过期(默认 86400 秒),则从远程源重新拉取数据。

随机化策略
  • 支持按浏览器类型(Chrome、Firefox、Safari)返回 UA
  • 内置权重算法,模拟真实用户分布比例
  • 可设置 fallback 参数应对获取失败场景

3.2 集成fake-useragent到Scrapy中间件的完整流程

在Scrapy项目中集成`fake-useragent`可有效提升爬虫的反检测能力。通过自定义下载中间件,动态切换User-Agent是关键步骤。
安装依赖
首先需安装核心库:
pip install fake-useragent scrapy
该命令引入`fake-useragent`,支持从真实浏览器数据中随机生成User-Agent字符串。
编写中间件
middlewares.py中定义类:
from fake_useragent import UserAgent
class FakeUserAgentMiddleware:
    def __init__(self):
        self.ua = UserAgent()

    def process_request(self, request, spider):
        request.headers['User-Agent'] = self.ua.random
UserAgent()初始化浏览器代理池,process_request为每个请求注入随机User-Agent,增强伪装真实性。
启用中间件
settings.py中激活:
  • 设置DOWNLOADER_MIDDLEWARES
  • 确保优先级低于默认中间件

3.3 动态生成策略的稳定性与异常处理

在动态生成策略中,系统的稳定性依赖于对异常情况的预判与响应机制。为确保服务连续性,需引入熔断、降级与重试机制。
异常监控与自动恢复
通过监控策略执行状态,及时发现超时或失败调用。一旦触发阈值,立即启用备用策略。
// 熔断器示例:当错误率超过50%时开启熔断
func NewCircuitBreaker() *CircuitBreaker {
    return &CircuitBreaker{
        failureThreshold: 5,
        timeout:          30 * time.Second,
    }
}
该代码定义了一个基础熔断器结构,failureThreshold 控制最大容忍失败次数,timeout 设置熔断持续时间,防止雪崩效应。
重试机制设计
  • 指数退避策略:避免频繁重试加剧系统负载
  • 最大重试次数限制:通常设置为3次
  • 结合上下文超时控制,防止协程泄漏

第四章:分布式User-Agent池的进阶方案

4.1 基于Redis的共享UA池架构设计

在高并发爬虫系统中,User-Agent(UA)的多样性对反爬策略绕过至关重要。采用Redis构建共享UA池,可实现多节点间UA信息的统一管理与实时同步。
核心数据结构设计
使用Redis的Set或List结构存储UA字符串,便于快速随机获取和去重维护:

# 添加UA到池中
SADD user_agent_pool "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
# 随机获取一个UA
SRANDMEMBER user_agent_pool
该设计支持高效写入与随机读取,适用于动态更新场景。
服务间协同机制
通过Redis的发布/订阅功能,实现UA池变更通知:
  • 当新增或删除UA时,触发频道消息
  • 各爬虫节点监听该频道并更新本地缓存
  • 保障集群视图一致性,降低请求冲突

4.2 多爬虫实例间的User-Agent协同调度

在分布式爬虫架构中,多个实例并行运行时,若User-Agent(UA)配置不当,易被目标站点识别为自动化行为。通过集中式UA池管理,可实现多实例间的协同调度。
UA池设计结构
  • 维护一个共享的UA池,支持动态更新与轮询获取
  • 结合Redis实现跨节点同步,确保各实例获取随机且不重复的UA
import random
import redis

class UserAgentPool:
    def __init__(self, redis_host='localhost', key='user_agents'):
        self.client = redis.Redis(host=redis_host, db=0)
        self.key = key

    def get_random_ua(self):
        agents = self.client.lrange(self.key, 0, -1)
        return random.choice(agents).decode('utf-8') if agents else None
上述代码实现了一个基于Redis的UA获取类,lrange获取所有UA条目,random.choice确保随机性,避免请求指纹重复。
调度策略优化
通过定时任务更新UA池内容,并结合请求频率动态调整分配节奏,提升反反爬对抗能力。

4.3 利用代理IP联动优化请求指纹多样性

在反爬机制日益复杂的背景下,单一代理IP即使轮换也难以规避基于行为指纹的检测。通过将代理IP与浏览器指纹参数联动控制,可显著提升请求的“自然性”。
动态指纹参数协同策略
每次请求不仅切换IP,同时调整User-Agent、屏幕分辨率、时区等特征,模拟真实用户设备变化。
  • 代理IP池按地域分类,匹配对应地区的常见设备配置
  • 使用随机化延迟和请求顺序,避免行为模式固化
type RequestFingerprint struct {
    ProxyURL    string            // 代理地址
    UserAgent   string            // 浏览器标识
    Resolution  string            // 屏幕分辨率
    TimeZone    string            // 时区设置
}

func NewRequestWithRandomProfile(proxyList []string) *RequestFingerprint {
    rand.Seed(time.Now().Unix())
    proxy := proxyList[rand.Intn(len(proxyList))]
    return &RequestFingerprint{
        ProxyURL:   proxy,
        UserAgent:  pickRandomUA(),
        Resolution: pickResolutionByRegion(proxy),
        TimeZone:   extractTZFromProxy(proxy),
    }
}
上述代码实现了一个基础的指纹结构体与随机化构造函数。通过从代理列表中随机选取IP,并根据其地理信息匹配相应的UA与分辨率,确保每个请求的网络层与应用层特征一致,降低被识别为自动化工具的风险。

4.4 池容量监控与自动扩缩容机制

实时监控指标采集
为实现精准的池容量管理,系统通过 Prometheus 客户端定期采集连接数、CPU 使用率、内存占用等关键指标。核心采集逻辑如下:
func (p *Pool) CollectMetrics() {
    usage := float64(p.CurrentSize) / float64(p.MaxSize)
    prometheus.MustRegister(prometheus.NewGaugeFunc(
        prometheus.GaugeOpts{Name: "pool_usage_ratio"},
        func() float64 { return usage },
    ))
}
该函数注册一个动态指标,实时反映资源池使用率,供后续扩缩容决策使用。
自动扩缩容策略
基于监控数据,系统采用分级响应策略:
  • 当使用率持续高于80%达2分钟,触发扩容(+2个实例)
  • 低于30%且空闲连接超5分钟,执行缩容(-1个实例)
  • 每次调整后冷却期为3分钟,防止震荡
状态阈值动作
高负载>80%扩容
低负载<30%缩容

第五章:从UA池到全面反反爬体系的演进思考

随着目标网站反爬机制的不断升级,单一依赖 User-Agent 轮换的 UA 池策略已难以维持稳定的数据采集。现代反爬虫系统不仅检测请求头特征,还结合行为分析、IP信誉、JavaScript指纹等多维度进行识别,迫使爬虫架构向更复杂的反反爬体系演进。
动态指纹模拟
为应对浏览器指纹检测,需在自动化工具中模拟真实用户环境。例如使用 Puppeteer 配合插件隐藏 WebDriver 特征,并注入伪造的 Canvas、WebGL 指纹:

await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', {
    get: () => false,
  });
});
await page.addInitScript(() => {
  window.chrome = { runtime: {} };
});
请求调度与IP治理
高匿名代理与本地出口 IP 的混合调度成为关键。通过构建代理健康评分模型,实时淘汰响应延迟高或触发验证码的节点:
  • 基于响应码频率计算节点可信度
  • 记录单IP单位时间请求数并动态限流
  • 结合地理定位规避异常跳转风险
行为模式拟真
模拟人类操作节奏可显著降低被封概率。包括随机滚动、鼠标移动轨迹生成和点击间隔抖动:
行为类型参数范围实现方式
页面停留3–15秒正态分布采样
滚动步长200–600px分段匀速动画
[客户端] → (负载均衡) → [指纹池] → [代理网关] → [目标站点] ↘ [请求日志] → [风控分析引擎]
2025-07-08 15:43:37 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: scrapybot) 2025-07-08 15:43:37 [scrapy.utils.log] INFO: Versions: {&#39;lxml&#39;: &#39;6.0.0&#39;, &#39;libxml2&#39;: &#39;2.11.9&#39;, &#39;cssselect&#39;: &#39;1.3.0&#39;, &#39;parsel&#39;: &#39;1.10.0&#39;, &#39;w3lib&#39;: &#39;2.3.1&#39;, &#39;Twisted&#39;: &#39;25.5.0&#39;, &#39;Python&#39;: &#39;3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 &#39; &#39;64 bit (AMD64)]&#39;, &#39;pyOpenSSL&#39;: &#39;25.1.0 (OpenSSL 3.5.1 1 Jul 2025)&#39;, &#39;cryptography&#39;: &#39;45.0.5&#39;, &#39;Platform&#39;: &#39;Windows-10-10.0.22631-SP0&#39;} 2025-07-08 15:43:37 [scrapy.addons] INFO: Enabled addons: [] 2025-07-08 15:43:37 [asyncio] DEBUG: Using selector: SelectSelector 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet Password: 8a6ca1391bfb9949 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled extensions: [&#39;scrapy.extensions.corestats.CoreStats&#39;, &#39;scrapy.extensions.telnet.TelnetConsole&#39;, &#39;scrapy.extensions.logstats.LogStats&#39;] 2025-07-08 15:43:37 [scrapy.crawler] INFO: Overridden settings: {&#39;DOWNLOAD_DELAY&#39;: 1, &#39;NEWSPIDER_MODULE&#39;: &#39;nepu_spider.spiders&#39;, &#39;SPIDER_MODULES&#39;: [&#39;nepu_spider.spiders&#39;], &#39;USER_AGENT&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 &#39; &#39;(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36&#39;} 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares: [&#39;scrapy.downloadermiddlewares.offsite.OffsiteMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39;, &#39;scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39;, &#39;scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39;, &#39;scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39;, &#39;scrapy.downloadermiddlewares.retry.RetryMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39;, &#39;scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39;, &#39;scrapy.downloadermiddlewares.stats.DownloaderStats&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled spider middlewares: [&#39;scrapy.spidermiddlewares.start.StartSpiderMiddleware&#39;, &#39;scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39;, &#39;scrapy.spidermiddlewares.referer.RefererMiddleware&#39;, &#39;scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39;, &#39;scrapy.spidermiddlewares.depth.DepthMiddleware&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled item pipelines: [&#39;nepu_spider.pipelines.MultiJsonPipeline&#39;] 2025-07-08 15:43:37 [scrapy.addons] INFO: Enabled addons: [] 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet Password: 671a36aa7bc330e0 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled extensions: [&#39;scrapy.extensions.corestats.CoreStats&#39;, &#39;scrapy.extensions.telnet.TelnetConsole&#39;, &#39;scrapy.extensions.logstats.LogStats&#39;] 2025-07-08 15:43:37 [scrapy.crawler] INFO: Overridden settings: {&#39;DOWNLOAD_DELAY&#39;: 1, &#39;NEWSPIDER_MODULE&#39;: &#39;nepu_spider.spiders&#39;, &#39;SPIDER_MODULES&#39;: [&#39;nepu_spider.spiders&#39;], &#39;USER_AGENT&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 &#39; &#39;(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36&#39;} 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares: [&#39;scrapy.downloadermiddlewares.offsite.OffsiteMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39;, &#39;scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39;, &#39;scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39;, &#39;scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39;, &#39;scrapy.downloadermiddlewares.retry.RetryMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39;, &#39;scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39;, &#39;scrapy.downloadermiddlewares.stats.DownloaderStats&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled spider middlewares: [&#39;scrapy.spidermiddlewares.start.StartSpiderMiddleware&#39;, &#39;scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39;, &#39;scrapy.spidermiddlewares.referer.RefererMiddleware&#39;, &#39;scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39;, &#39;scrapy.spidermiddlewares.depth.DepthMiddleware&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled item pipelines: [&#39;nepu_spider.pipelines.MultiJsonPipeline&#39;] 2025-07-08 15:43:37 [scrapy.addons] INFO: Enabled addons: [] 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet Password: 76f044bac415a70c 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled extensions: [&#39;scrapy.extensions.corestats.CoreStats&#39;, &#39;scrapy.extensions.telnet.TelnetConsole&#39;, &#39;scrapy.extensions.logstats.LogStats&#39;] 2025-07-08 15:43:37 [scrapy.crawler] INFO: Overridden settings: {&#39;DOWNLOAD_DELAY&#39;: 1, &#39;NEWSPIDER_MODULE&#39;: &#39;nepu_spider.spiders&#39;, &#39;SPIDER_MODULES&#39;: [&#39;nepu_spider.spiders&#39;], &#39;USER_AGENT&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 &#39; &#39;(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36&#39;} 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares: [&#39;scrapy.downloadermiddlewares.offsite.OffsiteMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39;, &#39;scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39;, &#39;scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39;, &#39;scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39;, &#39;scrapy.downloadermiddlewares.retry.RetryMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39;, &#39;scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39;, &#39;scrapy.downloadermiddlewares.stats.DownloaderStats&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled spider middlewares: [&#39;scrapy.spidermiddlewares.start.StartSpiderMiddleware&#39;, &#39;scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39;, &#39;scrapy.spidermiddlewares.referer.RefererMiddleware&#39;, &#39;scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39;, &#39;scrapy.spidermiddlewares.depth.DepthMiddleware&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled item pipelines: [&#39;nepu_spider.pipelines.MultiJsonPipeline&#39;] 2025-07-08 15:43:37 [scrapy.addons] INFO: Enabled addons: [] 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-07-08 15:43:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet Password: fc500ad4454da624 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled extensions: [&#39;scrapy.extensions.corestats.CoreStats&#39;, &#39;scrapy.extensions.telnet.TelnetConsole&#39;, &#39;scrapy.extensions.logstats.LogStats&#39;] 2025-07-08 15:43:37 [scrapy.crawler] INFO: Overridden settings: {&#39;DOWNLOAD_DELAY&#39;: 1, &#39;NEWSPIDER_MODULE&#39;: &#39;nepu_spider.spiders&#39;, &#39;SPIDER_MODULES&#39;: [&#39;nepu_spider.spiders&#39;], &#39;USER_AGENT&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 &#39; &#39;(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36&#39;} 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares: [&#39;scrapy.downloadermiddlewares.offsite.OffsiteMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39;, &#39;scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39;, &#39;scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39;, &#39;scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39;, &#39;scrapy.downloadermiddlewares.retry.RetryMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;, &#39;scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39;, &#39;scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39;, &#39;scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39;, &#39;scrapy.downloadermiddlewares.stats.DownloaderStats&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled spider middlewares: [&#39;scrapy.spidermiddlewares.start.StartSpiderMiddleware&#39;, &#39;scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39;, &#39;scrapy.spidermiddlewares.referer.RefererMiddleware&#39;, &#39;scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39;, &#39;scrapy.spidermiddlewares.depth.DepthMiddleware&#39;] 2025-07-08 15:43:37 [scrapy.middleware] INFO: Enabled item pipelines: [&#39;nepu_spider.pipelines.MultiJsonPipeline&#39;] 2025-07-08 15:43:37 [scrapy.core.engine] INFO: Spider opened 2025-07-08 15:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-07-08 15:43:37 [scrapy.core.engine] INFO: Spider opened 2025-07-08 15:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-07-08 15:43:37 [scrapy.core.engine] INFO: Spider opened 2025-07-08 15:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-07-08 15:43:37 [scrapy.core.engine] INFO: Spider opened 2025-07-08 15:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025 2025-07-08 15:43:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026 2025-07-08 15:43:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://xxgk.nepu.edu.cn/xxgklm/xxgk.htm> (referer: None) 2025-07-08 15:43:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/jgsz/jxdw.htm> (referer: None) 2025-07-08 15:43:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://zsxxw.nepu.edu.cn/> (referer: None) 2025-07-08 15:43:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nepu.edu.cn/xxgk/xxjj.htm> (referer: None) 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Closing spider (finished) 2025-07-08 15:43:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {&#39;downloader/request_bytes&#39;: 314, &#39;downloader/request_count&#39;: 1, &#39;downloader/request_method_count/GET&#39;: 1, &#39;downloader/response_bytes&#39;: 4815, &#39;downloader/response_count&#39;: 1, &#39;downloader/response_status_count/200&#39;: 1, &#39;elapsed_time_seconds&#39;: 0.265455, &#39;finish_reason&#39;: &#39;finished&#39;, &#39;finish_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 38, 4643, tzinfo=datetime.timezone.utc), &#39;httpcompression/response_bytes&#39;: 18235, &#39;httpcompression/response_count&#39;: 1, &#39;items_per_minute&#39;: None, &#39;log_count/DEBUG&#39;: 8, &#39;log_count/INFO&#39;: 26, &#39;response_received_count&#39;: 1, &#39;responses_per_minute&#39;: None, &#39;scheduler/dequeued&#39;: 1, &#39;scheduler/dequeued/memory&#39;: 1, &#39;scheduler/enqueued&#39;: 1, &#39;scheduler/enqueued/memory&#39;: 1, &#39;start_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 37, 739188, tzinfo=datetime.timezone.utc)} 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Spider closed (finished) 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Closing spider (finished) 2025-07-08 15:43:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {&#39;downloader/request_bytes&#39;: 311, &#39;downloader/request_count&#39;: 1, &#39;downloader/request_method_count/GET&#39;: 1, &#39;downloader/response_bytes&#39;: 5880, &#39;downloader/response_count&#39;: 1, &#39;downloader/response_status_count/200&#39;: 1, &#39;elapsed_time_seconds&#39;: 0.282532, &#39;finish_reason&#39;: &#39;finished&#39;, &#39;finish_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 38, 21720, tzinfo=datetime.timezone.utc), &#39;httpcompression/response_bytes&#39;: 18387, &#39;httpcompression/response_count&#39;: 1, &#39;items_per_minute&#39;: None, &#39;log_count/DEBUG&#39;: 6, &#39;log_count/INFO&#39;: 22, &#39;response_received_count&#39;: 1, &#39;responses_per_minute&#39;: None, &#39;scheduler/dequeued&#39;: 1, &#39;scheduler/dequeued/memory&#39;: 1, &#39;scheduler/enqueued&#39;: 1, &#39;scheduler/enqueued/memory&#39;: 1, &#39;start_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 37, 739188, tzinfo=datetime.timezone.utc)} 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Spider closed (finished) 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Closing spider (finished) 2025-07-08 15:43:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {&#39;downloader/request_bytes&#39;: 300, &#39;downloader/request_count&#39;: 1, &#39;downloader/request_method_count/GET&#39;: 1, &#39;downloader/response_bytes&#39;: 9026, &#39;downloader/response_count&#39;: 1, &#39;downloader/response_status_count/200&#39;: 1, &#39;elapsed_time_seconds&#39;: 0.284539, &#39;finish_reason&#39;: &#39;finished&#39;, &#39;finish_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 38, 22730, tzinfo=datetime.timezone.utc), &#39;httpcompression/response_bytes&#39;: 32943, &#39;httpcompression/response_count&#39;: 1, &#39;items_per_minute&#39;: None, &#39;log_count/DEBUG&#39;: 10, &#39;log_count/INFO&#39;: 39, &#39;response_received_count&#39;: 1, &#39;responses_per_minute&#39;: None, &#39;scheduler/dequeued&#39;: 1, &#39;scheduler/dequeued/memory&#39;: 1, &#39;scheduler/enqueued&#39;: 1, &#39;scheduler/enqueued/memory&#39;: 1, &#39;start_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 37, 738191, tzinfo=datetime.timezone.utc)} 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Spider closed (finished) 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Closing spider (finished) 2025-07-08 15:43:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {&#39;downloader/request_bytes&#39;: 311, &#39;downloader/request_count&#39;: 1, &#39;downloader/request_method_count/GET&#39;: 1, &#39;downloader/response_bytes&#39;: 9736, &#39;downloader/response_count&#39;: 1, &#39;downloader/response_status_count/200&#39;: 1, &#39;elapsed_time_seconds&#39;: 0.285536, &#39;finish_reason&#39;: &#39;finished&#39;, &#39;finish_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 38, 22730, tzinfo=datetime.timezone.utc), &#39;httpcompression/response_bytes&#39;: 25723, &#39;httpcompression/response_count&#39;: 1, &#39;items_per_minute&#39;: None, &#39;log_count/DEBUG&#39;: 13, &#39;log_count/INFO&#39;: 49, &#39;response_received_count&#39;: 1, &#39;responses_per_minute&#39;: None, &#39;scheduler/dequeued&#39;: 1, &#39;scheduler/dequeued/memory&#39;: 1, &#39;scheduler/enqueued&#39;: 1, &#39;scheduler/enqueued/memory&#39;: 1, &#39;start_time&#39;: datetime.datetime(2025, 7, 8, 7, 43, 37, 737194, tzinfo=datetime.timezone.utc)} 2025-07-08 15:43:38 [scrapy.core.engine] INFO: Spider closed (finished)
07-09
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值