1. 使用requests 发送post请求
responst = requests.post(url,
data={请求体}
)
2. 代理
正向代理和反向代理的区别
反向代理:站在客户端的角度上,为服务器代理的都叫反向代理
正向代理:站在客户端的角度上,为客户端代理的都叫正向代理
正向代理:对于浏览器知道服务器的真实地址,例如vpn
反向代理:浏览器不知道服务器的 真实地址,例如nginx
使用:
requests.get(url,proxies = proxies)
proxies的形式:字典
proxies = {
"http":"http://xxx.xx.xxx.xx"
"https":"https://xxx.xx.xxx.xx"
}
代理的分类
- 透明代理
- 匿名代理
- 高匿代理
请求使用的协议
- http协议
- https协议
- socket代理
爬虫中使用cookie
-
优点:
带上cookie能够访问登陆后的界面
能够实现部分反反爬 -
缺点:
一套cookieIbanez只是对应一个用户,不能太频繁的访问
requests处理cookie的方法
-
cookie字符串放在headers中
requests.get(url,headers={UA,Cookie}) -
把cookie字典传给请求方法的cookies参数接收
构建cookie字典
cookie:
cookies_set = 'has_recent_activity=1; _octo=GH1.1.989874598.1564974045; _ga=GA1.2.503644495.1564974080; _gat=1; tz=Asia%2FShanghai; _device_id=40315822662c352b84396c5c807a7128; user_session=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; __Host-user_session_same_site=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; logged_in=yes; dotcom_user=changanbaimao; _gh_sess=NDdmeEcrVlEyQVdyRnFhOUtRWEJ5S2NJRjN6SDAvRkF0b09iK3I0YVhlMHJXVnZNVXdTQnJyRy9IOXV2RkpWWkVIRDBHOU5aRS84YmUwMGdRZzBzYlpwM3QveHBqczVBUXJoRFA1N0VFK1MyNTdvVU05WlBheTdkTlhRRm5nVG1wSkNza3ZQTjg3STZFMWJoanpyVld3UzB1SW83d3N0MDBZVForcFhPV0RGREs1SWFWNEh3N0VyL0ZZdlBJK242RDF3dGR6aXRiY2J3MXUwaE9xVUg1dVJ2RzIvdEpoT2grLzZteXRNMTFBSjVTWmlMU3JSTjhqZHhZdmZKdDZHYnM4ZWx1Qk8yNU5wcytzREs0ZTNTY0E9PS0tS1FPMUhzVC9GaURZT3NDQmZVd0FLUT09--b9eefb15a3ea8dc350874018950afc9b0aa64287'
字典推导式:
cookies_dict = {cookie.split('=')[0]:cookie.split('=')[1] for cookie in cookies_set.split('; ')}
字典推导式:cookies_dict = {cookie.split('=')[0]:cookie.split('=')[1] for cookie in cookies_set.split('; ')}
1. 先以 '; ' for循环形式切割cookies_set,
2. 构建字典,key-vallue形式,
3. 以 '=' 切割for循环中的cookie 索引值为0的为key,索引为1的为vallue
requests,get(
url,
headers=headers,
cookies=cookie_dict
)
3. 使用requests提供的session模块
- 需要先实例化session
使用session登陆github
import requests
import re
# 实例化session
session = requests.session()
headers = {
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
# 'Cookie': 'has_recent_activity=1; _octo=GH1.1.989874598.1564974045; _ga=GA1.2.503644495.1564974080; _gat=1; tz=Asia%2FShanghai; _device_id=40315822662c352b84396c5c807a7128; user_session=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; __Host-user_session_same_site=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; logged_in=yes; dotcom_user=changanbaimao; _gh_sess=NDdmeEcrVlEyQVdyRnFhOUtRWEJ5S2NJRjN6SDAvRkF0b09iK3I0YVhlMHJXVnZNVXdTQnJyRy9IOXV2RkpWWkVIRDBHOU5aRS84YmUwMGdRZzBzYlpwM3QveHBqczVBUXJoRFA1N0VFK1MyNTdvVU05WlBheTdkTlhRRm5nVG1wSkNza3ZQTjg3STZFMWJoanpyVld3UzB1SW83d3N0MDBZVForcFhPV0RGREs1SWFWNEh3N0VyL0ZZdlBJK242RDF3dGR6aXRiY2J3MXUwaE9xVUg1dVJ2RzIvdEpoT2grLzZteXRNMTFBSjVTWmlMU3JSTjhqZHhZdmZKdDZHYnM4ZWx1Qk8yNU5wcytzREs0ZTNTY0E9PS0tS1FPMUhzVC9GaURZT3NDQmZVd0FLUT09--b9eefb15a3ea8dc350874018950afc9b0aa64287'
}
# 1. 获取登录页
url = 'https://github.com/login'
res = session.get(url,headers=headers)
authenticity_token = re.search(r'name="authenticity_token" value="(.*?)" />', res.text).group(1)
print(authenticity_token)
# 2. 发送post请求
url = 'https://github.com/session'
data = {
'commit':'Sign in',
'utf8':'✓',
'authenticity_token':authenticity_token,
'login':'xxxxx', # 输入你的用户名
'password': 'xxxxxx', # 输入你的密码
}
# 应该也可以换成一个变量使用input的方式来写这一段代码
session.post(url,data=data,headers=headers)
# 3. 获取最终验证的页面
url = 'https://github.com/changanbaimao'
res = session.get(url,headers=headers)
print(res.content.decode())
session = requests.session()
response = session.get(url,headers)
requests模块的其他方法
-
cookies_dict和cookies_jar的相互转换
requests.utils.dict_from_cookiejar(cj) —>dict
requests.utils.cookiejar_from_dict(cd) —>cookiejar -
解决https证书没有认证的网站抛异常
requests.packages.urllib3.disable_warnings()不显示安全提示,不推荐关闭warning
requests.get(url, verify=False) -
超时参数(timeout)
response = requests.get(url,timeout=3) # 3表示发送请求之后最多等待3秒,如果没有返回就抛出异常 -
retry装饰器
被装饰的函数如果发生异常,就重新执行该函数,
最多重试参数指定的最大的重试次数,抛出异常from retrying import retry
@retry(stop_max_attempt_number=3)
def func(): pass