requests模块

最新推荐文章于 2024-04-09 19:11:14 发布

走马走马

最新推荐文章于 2024-04-09 19:11:14 发布

阅读量218

点赞数

分类专栏： python 文章标签：爬虫 python

本文链接：https://blog.youkuaiyun.com/weixin_48232848/article/details/118416431

版权

python 专栏收录该内容

10 篇文章

订阅专栏

文章目录

requests模块

requests模块

1. requests模块的介绍

requests模块的官方文档链接 https://docs.python-requests.org/zh_CN/latest/user/quickstart.html

1.1 requests模块的作用

获取http请求，获取响应数据

1.2 requests模块是一个第三方模块，需要在python环境中额外的安装

pip/pip3 install requests

1.3 requests模块发送get请求

需求：通过requests向百度首页发送请求，获取该页面的源码
运行以下代码，观察打印输出的结果

# 导入requests模块
import requests

# 目标url
url = "https://www.baidu.com/"

# 向目标ur发送get请求
response = requests.get(url)

# 打印响应的内容
print(response.text)

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>


Process finished with exit code 0

2. response响应对象

观察以上代码的运行结果，结果中有很多的乱码，这事因为编解码使用的字符集不同所造成的，我们可以尝试使用一下的方式来解决中文乱码。

# 导入requests模块
import requests

# 目标url
url = "https://www.baidu.com/"

# 向目标ur发送get请求
response = requests.get(url)

# 打印响应的内容
# print(response.text)
print(response.content.decode())

D:\Python\python.exe C:/Users/26217/Desktop/note/Python--Spider/requests代码/1_requests模拟发送get请求.py
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>


Process finished with exit code 0

response.tetx是requests模块按照chardet模块推测出的编码字符集进行解码的结果
网络输入的字符串都是bytes类型的，所以response.text = response.content.decode()
我们可以在网页中的源码中搜索charset，尝试参考该编码字符集，注意存在不准确的情况

2.1 reponse.text 和 response.content的区别

reponse.text
- 类型：str
- 解码类型：requests模块自动根据HTTP头部响应的编码做出有根据的推测，推测的文本编码
response.content
- 类型：bytes
- 解码类型：没有指定

2.2 通过response.content进行decode，来解决中文乱码

response.content.decode()，默认是utf-8
response.content.decode(‘GBK’)
常见的编码字符集
- utf-8
- GBK
- gb2312
- ascii
- iso-8859-1

2.3 response响应对象的其他常用属性和方法

response.url 响应的url，有时候响应的url和请求的url并不一致
response.status_code 响应状态码
response.request.headers 响应对应的请求头
response.request._cookies 响应对应请求的cookies，返回cookieJar的类型
response.cookies 响应的cookie(经过set-cookie动作，返回cookieJar的类型)
response.json() 自动将json字符串类型的响应内容转换为python对象(dict or list)

# 导入requests模块
import requests

# 目标url
url = "https://www.baidu.com/"

# 向目标ur发送get请求
response = requests.get(url)

print(response.url)
print(response.status_code)
# 响应对应的请求头
print(response.request.headers)
# 响应的响应头
print(response.headers)
print(response.request._cookies)
print(response.cookies)

https://www.baidu.com/
200
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Thu, 01 Jul 2021 04:43:22 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
<RequestsCookieJar[]>
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

Process finished with exit code 0

3. requests模块发送请求

3.1 发送带有headers的请求

在浏览器中的headers中找到User-Agent，构造headers字典，完成下面的代码

import requests

url = 'https://www.baidu.com/'

response = requests.get(url)

print(len(response.content.decode()))  # 2349
print(response.content.decode())

# 构建请求头字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/11'
}

# 发送带请求头的请求
response1 = requests.get(url, headers=headers)

print(len(response1.content.decode()))  # 304750
print(response1.content.decode())

3.2 发送带参数的请求

在使用百度搜索的时候url地址后面会跟有一个？，那么该问号后边的就是请求参数，又叫做查询字符串

3.2.1 在url中携带参数

直接对含有参数的url进行请求

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/11'
}

url = 'https://www.baidu.com/s?wd=python'

response = requests.get(url, headers=headers)

with open('baidu-python.html', 'wb')as f:
    f.write(response.content)

3.2.2 通过params携带参数字典

构建请求参数字典
向接口发送请求的时候带上擦拿书字典，参数设置给params

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/11'
}

# url1 = 'https://www.baidu.com/s?wd=python'

# 最后面有没有问号结果都是一样的
url = 'https://www.baidu.com/s?'

# 请求参数是一个字典  wd=python
kw = {'wd': 'python'}

# 带上请求参数发起请求，获取响应
response = requests.get(url, headers=headers, params=kw)

print(response.content)

3.3 cookies参数的使用

cookies参数的形式：字典

cookies = {‘cookie的name’: ‘cookie的value’}

该字典对应请求头中Cookie字符串，以分号，空格号啊哦分割每一对字典键值对
等号左边的是一个cookie的name，对应cookies字典的key
等号右边的对应coolies字典的value

cookie参数的使用方法

response = requests.get(url, cookies)

将cookie字符串转换为cookies参数所需的字典

cookie_dict = {cookie[‘name’]: cookie[‘value’] for cookie in cookies_str.split(’;’)}

注意：cookie一般有过期时间，一旦过期需要重新获取

3.4 cookieJar对象转换为cookies字典的方法

使用request获取的response对象，具有cookie属性，该属性是一个cookieJar类型，包含了对方服务器设置在本地的cookie

转换方法

cookie_dict = requests.utils.dict_from_cookiejar(response.cookies)

其中response.cookies返回的就是cookiesJar类型的对象
requests.utils.dict_from_cookieJar 函数返回cookies字典

import requests

url = 'https://www.baidu.com/'

response = requests.get(url)

print(response.cookies)

dict_cookies = requests.utils.dict_from_cookiejar(response.cookies)

jar_cookies = requests.utils.cookiejar_from_dict(dict_cookies)
print(jar_cookies)

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
<RequestsCookieJar[<Cookie BDORZ=27315 for />]>

3.5 超时函数timeout的使用

超时参数timeout的使用方法

response = requests.get(url, timeout=3)

timeout=3 表示，发送请求后，3秒内返回响应，否则就抛出异常

import requests

url = 'https://twitter.com'

response = requests.get(url, timeout=3)  # 设置超时时间

3.6 了解代理及proxy的使用

3.6.1 理解使用代理的过程

代理ip是一个ip，指的是一个代理服务器
代理服务器能够帮我们向目标服务器转发请求

3.6.2 正向代理和反向代理的区别

从发送请求的一方的角度，来区分正向代理和反向代理
为浏览器或客户端(发送请求的一方)转发请求的，叫做正向代理
- 浏览器知道最终处理请求的服务器的真实ip地址，例如VPN
不为浏览器或客户端转发请求，而是为最终处理请求的服务器转发请求，叫做反向代理
- 浏览器不知道服务器的真实地址，例如nginx

3.6.3 代理ip的分类

根据代理的尼玛程度，代理IP可以分为一下三类：

透明代理：透明代理虽然可以直接“隐藏”你的IP地址，但是还是可以查到你是谁。目标服务器接收到底请求头如下：

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP

匿名代理：使用匿名代理。别人只能指定你使用了代理，无法知道你是谁。目标服务器接收到的请求头如下：

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Proxy IP

高匿代理：高匿代理让别人根本无法发现你是在使用代理，所以是最好的选择，毫无疑问使用高匿代理效果最好。目标服务器接收到的请求头如下：

REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined

根据网站所使用的的协议不同，需要使用相应协议的代理服务，从代理服务器请求的协议可以分为：
- http代理：目标为http协议
- https代理：目标为https协议
- socks隧道代理：
  - socks代理只是简单的传递数据包，不关心是何种协议
  - socks代理比http、https代理耗时少
  - socks代理可以转发http和https的请求

3.6.4 proxies代理参数的使用

用法

response = requests.get(url, proxies=proxies)

proxies的形式：字典
例如：

import requests

url = 'https://baidu.com'

proxies = {
    'http': 'http://114.230.123.74:9999',
    'https': 'https://114.230.123.74:9999',
}

response = requests.get(url, proxies=proxies)

print(response.text)

注意：如果proxies字典中包含多个键值对，发送请求时按照url地址的协议来选择相应的代理ip
- 代理如果使用成功，则会获取响应
- 代理使用失败则会卡滞，或者报错

3.7 使用verify参数忽略CA证书

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-93Z52992-1625210357422)(image/7DL%D3[`WPNWO{7T32GKEOF.png)]

原因：该网站的CA证书没有经过[受信任的根证书办法机构]的认证

3.7.1 运行代码查看代码向不安全的链接发送请求的效果

import requests
url = 'https://sam.huat.edu.cn:8433/selfservice/'
response = requests.get(url)

3.7.2 解决方案

为了在代码中能够正常的请求，我们需要使用到verify=False参数，此时的requests模块发送请求将不做CA证书的验证，verify参数能够忽略CA证书的认真

import requests
url = 'https://sam.huat.edu.cn:8433/selfservice/'
response = requests.get(url, verify=False)

4. requests模块发送post请求

4.1 requests发送post请求的方法

response = requests.post(url, data)
data参数接收一个字典
requests模块发送的post请求函数的其他参数和get请求的参数完全一致

4.2 POST请求练习

import requests
import json


class King(object):

    def __init__(self, word):
        self.url = 'https://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_ciba&sign=b99808e480828203'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3776.400 QQBrowser/10.6.4212.400',
        }
        self.word = word
        self.post_data = {
        'from': 'auto',
'to': 'en',
'q': self.word,
        }

    def get_data(self):
        # 使用post方法发送一个post请求，data为请求体的字典
        response = requests.post(self.url, headers=self.headers, data=self.post_data, )
        return response.content.decode()

    def parse_data(self, data):
        # loads方法将json字符串转换成python字典
        dict_data = json.loads(data)

        print(dict_data['content']['out'])

    def run(self):
        # 发送请求获取响应
        response = self.get_data()
        # print(response)
        # 数据解析
        self.parse_data(response)


if __name__ == '__main__':
    # word = input('请输入你要翻译的句子或者单词：')
    king = King('China')
    king.run()

5. 利用requests.session进行状态保持

5.1 requests.session的作用和应用场景

作用：
- 自动处理cookie，下一次记请求自动带上上一次的cookie
应用场景：
- 自动处理连续多次的请求过程中产生的cookie

5.2 requests.session的使用方法

session = requests.session # 实例化session对象
response = session.get(url, headers, ...)
response = session.post(url, data, ...)

session对象发送get或者post的请求参数与requests模块发送的请求的参数完全一致

= input(‘请输入你要翻译的句子或者单词：’)
king = King(‘China’)
king.run()


## 5. 利用requests.session进行状态保持

### 5.1 requests.session的作用和应用场景

- 作用：
  - 自动处理cookie，下一次记请求自动带上上一次的cookie
- 应用场景：
  - 自动处理连续多次的请求过程中产生的cookie

### 5.2 requests.session的使用方法

```python
session = requests.session # 实例化session对象
response = session.get(url, headers, ...)
response = session.post(url, data, ...)