首先主要的就是你应该对scrapy目录结构,有一个相对清晰的认识,至少有过一个demo
一、手动更新IP池
1.在settings配置文件中新增IP池:
IPPOOL=[
{"ipaddr":"61.129.70.131:8080"},
{"ipaddr":"61.152.81.193:9100"},
{"ipaddr":"120.204.85.29:3128"},
{"ipaddr":"219.228.126.86:8123"},
{"ipaddr":"61.152.81.193:9100"},
{"ipaddr":"218.82.33.225:53853"},
{"ipaddr":"223.167.190.17:42789"}
]
2.修改中间件文件middlewares.py
import random
from scrapy import signals
from myproxies.settings import IPPOOL
class MyproxiesSpiderMiddleware(object):
def __init__(self,ip=''):
self.ip=ip
def process_request(self, request, spider):
thisip=random.choice(IPPOOL)
print("this is ip:"+thisip["ipaddr"])
request.meta["proxy"]="http://"+thisip["ipaddr"]
3.在settings中设置DOWNLOADER_MIDDLEWARES
DOWNLOADER_MIDDLEWARES = {
# 'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,
'myproxies.middlewares.MyproxiesSpiderMiddleware':125
}
本文详细介绍了如何在Scrapy框架中手动更新IP池,包括在settings配置文件中新增IP池、修改middlewares.py文件实现随机选取IP以及设置DOWNLOADER_MIDDLEWARES参数。掌握这些步骤,有助于提高爬虫的稳定性和效率。
1555

被折叠的 条评论
为什么被折叠?



