day01-爬虫基础-发送get请求

request(http模块),该模块主要用于发送请求获取响应,该模块有很多的替代模块,比如urllib模块,但是在工作中用的最多的还是requests模块,requests的代码简洁易懂,相对于臃肿的urllib模块,使用requests编写的爬虫代码将会更少,而实现某一功能将会简单。

requests模块官方文档:

https://requests.readthedocs.io/projects/cn/zh_CN/latest/

requests模块作用:

发送http请求,获取响应数据

#导入requests
#调用get方法,对目标url发送请求
import requests
url="https://www.baidu.com"
response=requests.get(url)
#打印源码str类型数据
print(response.text)

response响应对象

import requests

url="https://www.baidu.com"

response=requests.get(url)
#手动设定编码格式
response.encoding='utf-8'#print(response.encoding)
print(response.text)#str类型
#response.content是存储的bytes类型的响应源码,可以进行decode操作
print(response.content.decode())#bytes类型,解决中文乱码

#常见的响应对象参数和方法
#响应url
print(response.url)

#z状态码
print(response.status_code)

#响应对象对应的请求头
print(response.request.headers)
#响应对象对应的响应头
print(response.headers)

#打印响应设置cookie
print(response.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

1.发送带请求头的请求

构建请求头字典赋值给headers

import requests

url="https://www.baidu.com"
response=requests.get(url)
print(len(response.content.decode()))
print(response.content.decode())
#构建请求头字典,携带请求头伪装成浏览器
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
#发送带请求头的请求
response1=requests.get(url,headers=headers)
print(len(response1.content.decode()))
print(response1.content.decode())

2.发送带参数的请求

url中直接携带参数

import requests
url="https://www.baidu.com/s?wd=python"
response=requests.get(url)
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
response=requests.get(url,headers=headers)
with open("baidu.html",'wb')as f:
    f.write(response.content)
print(response.content.decode())

使用params

构建参数字典

发送请求的时候设置参数字典

import requests
url="https://www.baidu.com/s?"
response=requests.get(url)
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
#构建参数字典
data={
"wd":"python"
}

response=requests.get(url,headers=headers,params=data)
print(response.url)
with open("baidu1.html",'wb')as f:
    f.write(response.content)
print(response.content.decode())

3.在headers中携带cookie

从浏览器中复制user-agent和cookie

浏览器中的请求头字段和值与headers参数中必须一致

headers请求参数字典中的cookie键对应的值是字符串

import requests

url="https://www.baidu.com"
#构建请求头
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
 "Cookie":"BIDUPSID=036366C70551E926EF5FF474405A7FD6; PSTM=1688460862; BAIDUID=036366C70551E9268DCB3970471BBAF3:FG=1; BAIDUID_BFESS=036366C70551E9268DCB3970471BBAF3:FG=1; ZFY=mLraZDju05ubOZe2LCCRdpIq7B172CwZVAYF:BmV0cAU:C; newlogin=1; BDUSS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BDUSS_BFESS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BD_HOME=1; BD_UPN=12314753; BA_HECTOR=2k81a4a10h24a18l0k0ka08m1ibf59m1o; BD_CK_SAM=1; PSINO=1; delPer=0; H_PS_PSSID=36552_38642_39026_39022_38942_38955_39037_38809_38990_39085_26350_39041_39100_39044; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[feWj1Vr5u3D]=mk3SLVN4HKm; B64_BOT=1; sugstore=1; COOKIE_SESSION=73_2_4_6_1_9_1_0_3_5_4_1_0_7_0_0_1688461506_1688461500_1689756532%7C6%230_2_1688461487%7C1"
}
response=requests.get(url,headers=headers)
open with("with_cookie","wb")as f:
    f.write(response.content)

4.使用cookies参数保持会话

cookies参数形式:字典

构建cookies字典

在请求的时候将cookies字典赋值给cookies参数


import requests

url="https://www.baidu.com"
#构建请求头
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",

}
#构建cookies字典
temp="BIDUPSID=036366C70551E926EF5FF474405A7FD6; PSTM=1688460862; BAIDUID=036366C70551E9268DCB3970471BBAF3:FG=1; BAIDUID_BFESS=036366C70551E9268DCB3970471BBAF3:FG=1; ZFY=mLraZDju05ubOZe2LCCRdpIq7B172CwZVAYF:BmV0cAU:C; newlogin=1; BDUSS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BDUSS_BFESS=2lhSkpqVzdsNGZ1eE50VXdmaXg3S2tmaS1JRHc1Mn5HWmVhcTJtN2t2dENxODFrRVFBQUFBJCQAAAAAAAAAAAEAAADevGQ8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIepmRCHqZka; BD_HOME=1; BD_UPN=12314753; BA_HECTOR=2k81a4a10h24a18l0k0ka08m1ibf59m1o; BD_CK_SAM=1; PSINO=1; delPer=0; H_PS_PSSID=36552_38642_39026_39022_38942_38955_39037_38809_38990_39085_26350_39041_39100_39044; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[feWj1Vr5u3D]=mk3SLVN4HKm; B64_BOT=1; sugstore=1; COOKIE_SESSION=73_2_4_6_1_9_1_0_3_5_4_1_0_7_0_0_1688461506_1688461500_1689756532%7C6%230_2_1688461487%7C1"

cookie_list=temp.split(";")
cookies={}

for cookie in cookie_list:
    cookies[cookie.split("=")[0]]=cookie.split("=")[-1]
#cookies={cookie.split("=")[0]:cookie.split("=")[-1] for cookie in cookie_list}
print(cookies)
response=requests.get(url,headers=headers,cookies=cookies)

5.cookieJar对象转换为cookies字典

import requests

url="https://www.baidu.com"

response=requests.get(url)

print(response.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

dict_cookies=requests.utils.dict_from_cookiejar(response.cookies)
print(dict_cookies)#{'BDORZ': '27315'}

jar_cookies=requests.utils.cookiejar_from_dict(dict_cookies)
print(jar_cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for />]>

6.超时参数timeout使用

import requests

url="https://twitter.com"

response=requests.get(url,timeout=3) #3秒后停止等待

7.代理的使用

代理ip是一个ip,指向的是一个代理服务器,帮助我们向目标服务器发送请求 正向代理:浏览器知道最处理请求的服务器的真实ip地址,例如VPN 反向代理:浏览器不知道服务器的真实地址,例如nginx 代理ip的分类:

1.透明代理(Transparent Proxy):透明代理可以直接隐藏ip地址,还是可以查到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=Proxy IP HTTP_X_FORWARDED_FOR=YOUR IP

2.匿名代理(Anonymous Proxy):目标服务器只知道你使用了代理,无法查到是谁,到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=Proxy IP HTTP_X_FORWARDED_FOR=Proxy IP

3.高匿代理(Elite proxy 或 High Anonymity proxy):无法发现是在用代理,无法查到是谁,到是谁,目标服务器接受的请求头如下: REMOTE_ADDR=Proxy IP HTTP_VIA=not determined HTTP_X_FORWARDED_FOR=not determined 代理ip的协议分类: http https socks

proxies代理参数的使用
proxies的形式——>字典

import requests

import requests_practice

url="https://www.baidu.com"

# response=requests.get(url)
proxies={
    "http":"https://106.14.5.129:80",
    "https":"https://ip:端口号"
}

response=requests.get(url,proxies=proxies)
print(response.text)

8.使用verify参数忽略CA证书

import requests

url="https://sam.huat.edu.cn:8433/selfservice/"

response=requests.get(url,verify=False)
print(response.content)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值