request(http模块),该模块主要用于发送请求获取响应,该模块有很多的替代模块,比如urllib模块,但是在工作中用的最多的还是requests模块,requests的代码简洁易懂,相对于臃肿的urllib模块,使用requests编写的爬虫代码将会更少,而实现某一功能将会简单。
requests模块官方文档:
https://requests.readthedocs.io/projects/cn/zh_CN/latest/
requests模块作用:
发送http请求,获取响应数据
#导入requests
#调用get方法,对目标url发送请求
import requests
url="https://www.baidu.com"
response=requests.get(url)
#打印源码str类型数据
print(response.text)
response响应对象
import requests
url="https://www.baidu.com"
response=requests.get(url)
#手动设定编码格式
response.encoding='utf-8'#print(response.encoding)
print(response.text)#str类型
#response.content是存储的bytes类型的响应源码,可以进行decode操作
print(response.content.decode())#bytes类型,解决中文乱码
#常见的响应对象参数和方法
#响应url
print(response.url)
#z状态码
print(response.status_code)
#响应对象对应的请求头
print(response.request.headers)
#响应对象对应的响应头
print(response.headers)
#打印响应设置cookie
print(response.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
1.发送带请求头的请求
构建请求头字典赋值给headers
import requests
url="https://www.baidu.com"
response=requests.get(url)
print(len(response.content.decode()))
print(response.content.decode())
#构建请求头字典,携带请求头伪装成浏览器
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
#发送带请求头的请求
response1=requests.get(url,headers=headers)
print(len(response1.content.decode()))
print(response1.content.decode())
2.发送带参数的请求
url中直接携带参数
import requests
url="https://www.baidu.com/s?wd=python"
response=requests.get(url)
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
response=requests.get(url,headers=headers)
with open("baidu.html",'wb')as f:
f.write(response.content)
print(response.content.decode())
使用params
构建参数字典
发送请求的时候设置参数字典
import requests
url="https://www.baidu.com/s?"
response=requests.get(url)
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
#构建参数字典
data={
"wd":"python"
}
response=requests.get(url,headers=headers,params=data)
print(response.url)
with open("baidu1.html",'wb')as f:
f.write(response.content)
print(response.content.decode())
3.在headers中携带cookie
从浏览器中复制user-agent和cookie
浏览器中的请求头字段和值与headers参数中必须一致
headers请求参数字典中的cookie键对应的值是字符串
import requests
url="https://www.baidu.com"
#构建请求头
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
"Cookie":"BIDUPSID=036366C70551E926EF5FF474405A7FD6; PSTM=1688460862; BAIDUID=036366C70551E9268DCB3970471BBAF3:FG=1; BAIDUID_BFESS=036366C70551E9268DCB3970471BBAF3:FG=1; ZFY=mLraZDju05ubOZe2LCCRdpIq7B172CwZVAYF:BmV0cAU:C; newlogin=1; BDUSS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BDUSS_BFESS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BD_HOME=1; BD_UPN=12314753; BA_HECTOR=2k81a4a10h24a18l0k0ka08m1ibf59m1o; BD_CK_SAM=1; PSINO=1; delPer=0; H_PS_PSSID=36552_38642_39026_39022_38942_38955_39037_38809_38990_39085_26350_39041_39100_39044; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[feWj1Vr5u3D]=mk3SLVN4HKm; B64_BOT=1; sugstore=1; COOKIE_SESSION=73_2_4_6_1_9_1_0_3_5_4_1_0_7_0_0_1688461506_1688461500_1689756532%7C6%230_2_1688461487%7C1"
}
response=requests.get(url,headers=headers)
open with("with_cookie","wb")as f:
f.write(response.content)
4.使用cookies参数保持会话
cookies参数形式:字典
构建cookies字典
在请求的时候将cookies字典赋值给cookies参数
import requests
url="https://www.baidu.com"
#构建请求头
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
}
#构建cookies字典
temp="BIDUPSID=036366C70551E926EF5FF474405A7FD6; PSTM=1688460862; BAIDUID=036366C70551E9268DCB3970471BBAF3:FG=1; BAIDUID_BFESS=036366C70551E9268DCB3970471BBAF3:FG=1; ZFY=mLraZDju05ubOZe2LCCRdpIq7B172CwZVAYF:BmV0cAU:C; newlogin=1; BDUSS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BDUSS_BFESS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BD_HOME=1; BD_UPN=12314753; BA_HECTOR=2k81a4a10h24a18l0k0ka08m1ibf59m1o; BD_CK_SAM=1; PSINO=1; delPer=0; H_PS_PSSID=36552_38642_39026_39022_38942_38955_39037_38809_38990_39085_26350_39041_39100_39044; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[feWj1Vr5u3D]=mk3SLVN4HKm; B64_BOT=1; sugstore=1; COOKIE_SESSION=73_2_4_6_1_9_1_0_3_5_4_1_0_7_0_0_1688461506_1688461500_1689756532%7C6%230_2_1688461487%7C1"
cookie_list=temp.split(";")
cookies={}
for cookie in cookie_list:
cookies[cookie.split("=")[0]]=cookie.split("=")[-1]
#cookies={cookie.split("=")[0]:cookie.split("=")[-1] for cookie in cookie_list}
print(cookies)
response=requests.get(url,headers=headers,cookies=cookies)
5.cookieJar对象转换为cookies字典
import requests
url="https://www.baidu.com"
response=requests.get(url)
print(response.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
dict_cookies=requests.utils.dict_from_cookiejar(response.cookies)
print(dict_cookies)#{'BDORZ': '27315'}
jar_cookies=requests.utils.cookiejar_from_dict(dict_cookies)
print(jar_cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for />]>
6.超时参数timeout使用
import requests
url="https://twitter.com"
response=requests.get(url,timeout=3) #3秒后停止等待
7.代理的使用
代理ip是一个ip,指向的是一个代理服务器,帮助我们向目标服务器发送请求 正向代理:浏览器知道最处理请求的服务器的真实ip地址,例如VPN 反向代理:浏览器不知道服务器的真实地址,例如nginx 代理ip的分类:
1.透明代理(Transparent Proxy):透明代理可以直接隐藏ip地址,还是可以查到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=Proxy IP HTTP_X_FORWARDED_FOR=YOUR IP
2.匿名代理(Anonymous Proxy):目标服务器只知道你使用了代理,无法查到是谁,到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=Proxy IP HTTP_X_FORWARDED_FOR=Proxy IP
3.高匿代理(Elite proxy 或 High Anonymity proxy):无法发现是在用代理,无法查到是谁,到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=not determined HTTP_X_FORWARDED_FOR=not determined 代理ip的协议分类: http https socks
proxies代理参数的使用
proxies的形式——>字典
import requests
import requests_practice
url="https://www.baidu.com"
# response=requests.get(url)
proxies={
"http":"https://106.14.5.129:80",
"https":"https://ip:端口号"
}
response=requests.get(url,proxies=proxies)
print(response.text)
8.使用verify参数忽略CA证书
import requests
url="https://sam.huat.edu.cn:8433/selfservice/"
response=requests.get(url,verify=False)
print(response.content)