为什么会想写这个称不上技术的技术文章呢?首先是博主前不久没使用好代理ip,浪费了一大笔银子,最近才弄懂python3版本中如何正确使用代理ip。
先看博主之前写的代码吧:
#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080"}
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Host":"www.xxx.com",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
"Accept-Encoding":"gzip, deflate, br",
"DNT":"1",
"Connection":"keep-alive",
"Upgrade-Insecure-Requests":"1",
"Cache-Control":"max-age=0, no-cache",
"Pragma":"no-cache"
}
res = requests.get(url,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)
这样子写是无法使用代理ip的,虽然能正常访问网页,但却是用自己的ip地址取访问,并不是用代理ip。很容易封ip,所以那次项目花了博主10天才爬完,浪费了时间也浪费了兜里的银子。所以在网上查阅资料许多资料,代码变成这样:
#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"}
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Host":"www.xxx.com",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
"Accept-Encoding":"gzip, deflate, br",
"DNT":"1",
"Connection":"keep-alive",
"Upgrade-Insecure-Requests":"1",
"Cache-Control":"max-age=0, no-cache",
"Pragma":"no-cache"
}
res = requests.get(url,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)
这次代码加上了https proxie,但原程连接一直被拒绝,所以就有了下面最终正确使用proxies的代码:
#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"}
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Host":"www.xxx.com",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
"Accept-Encoding":"gzip, deflate, br",
"DNT":"1",
"Connection":"keep-alive",
"Upgrade-Insecure-Requests":"1",
"Cache-Control":"max-age=0, no-cache",
"Pragma":"no-cache"
}
res = requests.get(url,verify=False,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)
加上verify = False,不验证SSL,就好了。这个时候去用爬虫访问网站用的是代理IP地址了!!!
可以开启愉快的爬虫之旅了~~~
本文分享了博主因不当使用代理IP而导致项目进度受阻的经历,并详细介绍了如何在Python爬虫中正确配置HTTP及HTTPS代理,避免IP被封,提高爬虫效率。
909

被折叠的 条评论
为什么被折叠?



