scrapy的settings中的常用设置.

本文详细介绍了Scrapy爬虫的设置参数,包括调整搜索策略、并发请求、日志配置、下载延迟等关键参数,帮助读者理解如何优化爬虫性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

scrapy.settings中的的一些设置.

# scrapy默认深度优先, 如果想换成广度优先..添加下面代码.
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue
SCHEDULER_ORDER = ‘BFO’

# 增加全局并发数
CONCURRENT_REQUESTS = 100

# LOG_FILE = 'spider_name.log' # 最好为爬虫名称
LOG_FILE = BOT_NAME + '_' + time.strftime("%Y%m%d", time.localtime()) + '.log'

# 日志等级
LOG_LEVEL = 'INFO'

# 是否启用日志(创建日志后,不需开启,进行配置)
LOG_ENABLED = True # (默认为True,启用日志)

# 日志编码
LOG_ENCODING = 'utf-8'

# 如果是True ,进程当中,所有标准输出(包括错误)将会被重定向到log中;例如:在爬虫代码中的 print()
LOG_STDOUT = False # 默认为False

# 如果使用自定义cookie就把COOKIES_ENABLED设置为True
# 如果使用settings的cookie就把COOKIES_ENABLED设置为False

#Scrapy downloader 并发请求(concurrent requests)的最大值,默认: 16
#CONCURRENT_REQUESTS = 32

#为同一网站的请求配置延迟(默认值:0)
#DOWNLOAD_DELAY = 3 下载器在下载同一个网站下一个页面前需要等待的时间,该选项可以用来限制爬取速度,减轻服务器压力。同时也支持小数:0.25 以秒为单位
#下载延迟设置只有一个有效

#CONCURRENT_REQUESTS_PER_DOMAIN = 16 对单个网站进行并发请求的最大值。
#CONCURRENT_REQUESTS_PER_IP = 16  对单个IP进行并发请求的最大值。如果非0,则忽略 CONCURRENT_REQUESTS_PER_DOMAIN 设定,使用该设定。 也就是说,并发限制将针对IP,而不是网站。该设定也影响 DOWNLOAD_DELAY: 如果 CONCURRENT_REQUESTS_PER_IP 非0,下载延迟应用在IP而不是网站上。
#覆盖默认请求标头:  不包含cookie
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}

#启用或禁用蜘蛛中间件
#SPIDER_MIDDLEWARES = {
# 'demo1.middlewares.Demo1SpiderMiddleware': 543,
#}

#启用或禁用下载器中间件
#DOWNLOADER_MIDDLEWARES = {
# 'demo1.middlewares.MyCustomDownloaderMiddleware': 543,
#}

#启用或禁用扩展程序
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

#配置项目管道
#ITEM_PIPELINES = {
# 'demo1.pipelines.Demo1Pipeline': 300,
#}

#启用和配置AutoThrottle扩展(默认情况下禁用)
#AUTOTHROTTLE_ENABLED = True

#初始下载延迟
#AUTOTHROTTLE_START_DELAY = 5 


#在高延迟的情况下设置的最大下载延迟

#AUTOTHROTTLE_MAX_DELAY = 60

#Scrapy请求的平均数量应该并行发送每个远程服务器
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

#启用显示所收到的每个响应的调节统计信息:
#AUTOTHROTTLE_DEBUG = False

#启用和配置HTTP缓存(默认情况下禁用)
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
PS D:\test\autohome> scrapy crawl car_spider 2025-06-06 15:52:53 [scrapy.utils.log] INFO: Scrapy 2.13.0 started (bot: autohome) 2025-06-06 15:52:53 [scrapy.utils.log] INFO: Versions: {'lxml': '5.4.0', 'libxml2': '2.11.9', 'cssselect': '1.3.0', 'parsel': '1.10.0', 'w3lib': '2.3.1', 'Twisted': '24.11.0', 'Python': '3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 ' '64 bit (AMD64)]', 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.0 8 Apr 2025)', 'cryptography': '45.0.3', 'Platform': 'Windows-10-10.0.26100-SP0'} 2025-06-06 15:52:53 [scrapy.addons] INFO: Enabled addons: [] 2025-06-06 15:52:53 [asyncio] DEBUG: Using selector: SelectSelector 2025-06-06 15:52:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-06-06 15:52:53 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-06-06 15:52:53 [scrapy.extensions.telnet] INFO: Telnet Password: 81e14a917b757f8e 2025-06-06 15:52:53 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2025-06-06 15:52:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'autohome', 'DOWNLOAD_DELAY': 0.5, 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'autohome.spiders', 'SPIDER_MODULES': ['autohome.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'} 2025-06-06 15:52:53 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2025-06-06 15:52:53 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.start.StartSpiderMiddleware', 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-06-06 15:52:53 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.images.ImagesPipeline'] 2025-06-06 15:52:53 [scrapy.core.engine] INFO: Spider opened 2025-06-06 15:52:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-06-06 15:52:53 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2025-06-06 15:52:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://car.autohome.com.cn/price/series-3179-0-3-0-0-0-0-1.html> from <GET https://car.autohome.com.cn/price/series-3179.html> 2025-06-06 15:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://car.autohome.com.cn/price/series-3179-0-3-0-0-0-0-1.html> (referer: None) 2025-06-06 15:52:55 [scrapy.downloadermiddlewares.offsite] DEBUG: Filtered offsite request to 'car.autohome.com.cn': <GET https://car.autohome.com.cn/price/series-3179.html> 2025-06-06 15:52:55 [scrapy.core.engine] INFO: Closing spider (finished) 2025-06-06 15:52:55 [scrapy.extensions.feedexport] INFO: Stored csv feed (0 items) in: cars.csv 2025-06-06 15:52:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 658, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 56623, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 1, 'elapsed_time_seconds': 1.399357, 'feedexport/success_count/FileFeedStorage': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2025, 6, 6, 7, 52, 55, 233288, tzinfo=datetime.timezone.utc), 'httpcompression/response_bytes': 301944, 'httpcompression/response_count': 1, 'items_per_minute': 0.0, 'log_count/DEBUG': 6, 'log_count/INFO': 11, 'offsite/domains': 1, 'offsite/filtered': 6, 'request_depth_max': 1, 'response_received_count': 1, 'responses_per_minute': 60.0, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2025, 6, 6, 7, 52, 53, 833931, tzinfo=datetime.timezone.utc)} 2025-06-06 15:52:55 [scrapy.core.engine] INFO: Spider closed (finished) 运行之后出现这个问题
06-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值