python requests库学习

最新推荐文章于 2025-05-09 11:29:48 发布

舞动的獾

最新推荐文章于 2025-05-09 11:29:48 发布

阅读量448

点赞数

分类专栏： python

本文链接：https://blog.youkuaiyun.com/Yu_csdnstory/article/details/107686626

版权

python 专栏收录该内容

10 篇文章

订阅专栏

本文介绍了Python的requests库，包括安装、查看调用方法、添加代理、处理响应码、输出内容、设置跳转、修改编码、查看请求头和参数、保存结果、返回结果编码、添加请求头、修改cookie、使用POST方法、处理JSON数据以及结合BeautifulSoup进行网页解析。通过实例展示了requests库在网页抓取中的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

写在前面
安装requests库只需要pip一条命令，安装python编译环境是一定要勾选将pip安装到环境变量的选项。

pip install requests

查看requests库调用方法

import requests
print(dir(requests))

添加代理

r=requests.get(url,proxies=proxies)
proxies是一个字典对象

import requests


url="https://www.baidu.com"
proxies={
    "http":"http://127.0.0.1:8080",
    "https":"https://127.0.0.1:8080",
    }

'''proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}
//用于代理有用户和密码使用
'''
#print(dir(requests))

#r=requests.get(url)
r=requests.get(url,proxies=proxies,verify=False)

倘若不加上verify=False，就会直接报错，这和ssl，https协议有关，加上verify=False就只会报出警告，但是不影响执行。

响应码

301 跳转
403 权限不足
404 页面不存在
500及以上 访问出错

获取状态码：

print(r.status_code)

输出内容

print(r.content)
print(r.text)

//推荐使用content,因为中文时text显示可能乱码

跳转

有的网站设置了自动跳转，requests库默认支持跳转，但是我们可以通过设置
allow_redirects=False使得跳转为flase.

r = requests.get(url, allow_redirects=False)

r.history//可以查看重定向跳转了多少次

修改系统默认编码格式

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

查看请求头

print r.request.headers

查看响应头

print r.headers

查看请求参数

print r.request.body

保存输出结果

#保存返回结果
fh = open("abc.txt", 'w')
fh.write(str(r.content))
fh.close()

返回结果编码

查看使用编码

print(r.encoding)

指定编码

r.encoding = "utf-8"

添加请求头

headers = {'user-agent': 'my-aasdasdaspp/0.0.1', 'asd':'hello world!'}//字典的形式传递
r=requests.get(url,headers=headers)

修改cookie

cookies = dict(cookies_are='working')
r=requests.get(url,cookies=cookies)

也可以直接在headers中直接加

headers = {'user-agent': 'my-aasdasdaspp/0.0.1', 'asd':'hello world!','cookies':'asdfg'}

获得响应头中的cookie
print r.cookies['domain']

post方法

首先确定要求传入的变量，制成字典

payload = {'name': 'loop', 'age': '12'}
r=request.post(url,data=paylaod)

json数据

字典转为json

import json

_dic={"name":"loop","age":"12"}
s=json.dumps(_dic,indent=4)  //indent=表示间距
print(s)
输出s的类型，发现是str
print(type(s))

json转为字典

d=json.loads(s)
print(d)
print(type(d))

如果爬取返回的是json数据，直接在r.json返回结果。

beautifulsoup

安装

pip install bs4

引入beautifulsoup

import requests
from bs4 import BeautifulSoup

制作成"汤"

r = requests.get(url, "html.parser")
soup = BeautifulSoup(r.content)

取出想要的值

取出结果中的header信息 //body，title啥的语法一样

print(soup.header)

取出a标签

soup.find('a'）

取出全部的a标签

soup.find_all("a")

取出a标签中的href，并保存到文件中

fh=open("jiaosm.txt",'w',encoding='utf-8')
for i in soup.find_all("a"):
    b=i['href']
    fh.write(b+'\n')
fh.close()

根据id

soup.find_all(id="file2")

取出href
soup.find_all(id="file2")[0].[href]

取出所有id
soup.find_all(id=True)

官方文档

爬取示例
将哟个网站的图片爬取下来保存到本地。

import requests
from bs4 import BeautifulSoup

PATH = "C:\\Users\\aaaaa\\Desktop\buf\\"

def get_image(img_url):
    img_name=img_url[img_url.rfind('/')+1:]
    file=PATH+img_name
    r=requests.get(img_url)
    con=r.content
    o=open(PATH+img_name,'wb')
    o.write(con)
    o.close()
    return file

def main(url):
    r = requests.get(url,"html.parser")
    soup = BeautifulSoup(r.content)
    imgs = soup.find_all('img')
    for img in imgs:
        try:
            img_url=img['src']
            if 'http' not in img_url:
                img_url=url+img_url
            img['src']=get_image(img_url)
            #print(img_url)
        except:
            pass
            
    o=open("C:\\Users\\aaaaa\\Desktop\\"+"test.html",'w',encoding='utf-8')
    o.write(str(soup))
    o.close()

if __name__=="__main__":
    url=""
    main(url)