使用爬虫的过程中即使再使用time.sleep()函数暂停,对于很多网站仍然会被封锁,因此需要使用代理,网上推荐较多的是西刺代理,本文编写个简单的爬虫来获取西刺代理国内高匿代理的IP加端口,可以获取到地址后,可以在爬虫中构建代理地址池,不断的使用不同的代理发起爬虫,防止被封锁。代码如下:
from bs4 import BeautifulSoup # 解析网页
from fake_useragent import UserAgent # 随机生成User-agent
import urllib.request, csv
#获取代理,使用的是西刺代理,通过解析网页获取代理池
def get_proxy():
# 报头设置
def header(website):
ua = UserAgent()
headers=("User-Agent", ua.random)
opener = urllib.request.build_opener()
opener.addheaders = [headers]
req = opener.open(website).read()
return req
# 读取网页
proxy_api = 'http://www.xicidaili.com/nn'
data = header(proxy_api).decode('utf-8')
data_soup = BeautifulSoup(data, 'lxml')
data_odd = data_soup.select('.odd')
data_ = data_soup.select('.')
# 解析代理网址 获取ip池(100个)
ip,port = [],[]
for i in range(len(data_odd)):
data_temp = data_odd[i].get_text().strip().split('\n')
while '' in data_temp:
data_temp.remove('')
ip.append(data_temp[0])
port.append(data_temp[1])
for i in range(len(data_)):
data_temp = data_[i].get_text().strip().split('\n')
while '' in data_temp:
data_temp.remove('')
ip.append(data_temp[0])
port.append(data_temp[1])
if len(ip) == len(port):
proxy = [':'.join((ip[i],port[i])) for i in range(len(ip))]
#print('成功获取代理ip与port!')
return proxy
else:
print('ip长度与port长度不一致!')
proxy = get_proxy()
with open('proxy.csv','w',encoding = 'utf-8-sig', newline = '') as f:
w = csv.writer(f)
w.writerow(['代理地址+端口'])
with open('proxy.csv', 'a+', encoding = 'utf-8-sig') as f:
w = csv.writer(f, lineterminator='\n')
for i in range(0, len(proxy)):
try:
w.writerow([proxy[i]])
except Exception as e:
print('第 ', i , proxy[i],'error: ', e, '\n')
print("成功写入文件!")
结果写入了CSV,使用utf-8在Windows下乱码,直接改用UTF-8-sig不会乱码,结果如下:
代理地址+端口 110.73.42.240:8123 118.190.95.35:9001 118.190.95.26:9001 118.190.95.43:9001 122.114.31.177:808 118.114.77.47:8080 221.10.159.234:1337 112.87.103.61:8118 183.159.95.200:41373 110.73.6.50:8123 101.236.60.52:8866 114.231.65.99:18118 114.231.69.209:18118 180.118.243.59:61234 101.236.23.202:8866 114.246.244.131:8118 111.155.116.217:8123 61.135.155.82:443 101.236.60.48:8866 111.155.116.249:8123 117.86.16.50:18118 101.236.21.22:8866 122.246.49.130:8010 111.155.116.234:8123 117.86.9.188:18118 182.114.129.61:37152 106.56.102.20:8070 221.227.251.117:18118 125.120.201.22:6666 121.31.101.113:8123 180.212.26.202:8118 180.118.242.119:808 111.155.116.211:8123 59.62.165.95:53128 115.204.30.147:6666 180.125.137.37:8000####检测代理地址是否可用
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'
}
print("共计{}个代理地址".format(len(proxy)))
fp = open('proxy.csv','w+',newline='',encoding='utf-8-sig')
writer = csv.writer(fp)
writer.writerow(('代理地址'))
test_results = []
for i in range(len(proxy)):
ip,port = proxy[i].split(':')
try:
telnetlib.Telnet(ip, port=port, timeout=20)
except:
print('{}:{}检测失败'.format(ip,port))
else:
print('{}:{}检测成功'.format(ip,port))
writer.writerow((proxy[i]))
print("检测完毕")
##加载可用的地址
csv_file = csv.reader(open('proxy.csv','r'))
print("可用的代理IP如下:")
for ip in csv_file:
print(ip)
.......
当然,上面的代理地址不一定可以使用,这里使用telent来检测代理地址是否可用,并存储下来:
在获取到可用的代理地址池后就可以使用代理发起Request查询,代码示例如下:
#url为带爬取网站的url,proxy_addr为从地址池中随机获取的地址。
def use_proxy(url,proxy_addr):
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(req).read().decode('utf-8','ignore')
return data
使用随机函数调用use_proxy即可:
for i in range(len(url)):
for data in use_proxy(url[i],proxy[random.randint(3,50)]):.......