爬虫：urllib基本库的使用

最新推荐文章于 2024-04-14 00:26:34 发布

原创最新推荐文章于 2024-04-14 00:26:34 发布 · 409 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#urllib

爬虫专栏收录该内容

3 篇文章

订阅专栏

本文深入讲解Python的urllib模块，包括发送GET和POST请求的方法，处理URL，解析robots.txt文件，以及如何使用Request对象和Handler进行更高级的操作。同时，介绍了如何处理请求过程中可能遇到的异常。

urllib包含了request（打开和读取url）, error（包含request引发的异常）, parse（解析url）, robotparser（解析robots.txt文件）四个用于处理URL的模块。

一.发送请求

1.urlopen()

使用urllib.request.urlopen()发送请求：

https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen

发送请求后得到HTTPResponse对象，调用HTTPResponse的相关方法和属性，可以获取相关信息：

https://docs.python.org/3/library/http.client.html#httpresponse-objects

代码示例：

# -*- coding:utf-8 -*-
from urllib import request, error, parse, robotparser
import socket

# get请求
url = 'https://wx.zsxq.com/dweb/#/login'  # 知识星球登录页
res = request.urlopen(url)  # 使用urllib.request模块，发送请求后得到HTTPResponse对象
web_server = res.getheader('Server')  # 查看运行知识星球的服务器类型
print(web_server)   # Tengine（详见http://tengine.taobao.org/）

# post请求
data = bytes(parse.urlencode({'data': '请求的数据'}), encoding='utf-8')  # 使用urllib.parse模块
try:
    res = request.urlopen('https://httpbin.org/post', data=data, timeout=0.01)  # 设置超时时间为0.01s
except error.URLError as e:  # 使用urllib.error模块
    if isinstance(e.reason, socket.timeout):
        print('超时')

2.Request

向urlopen()传递参数并不能构造一个完整的请求对象，所以有了Request Object对象：

https://docs.python.org/3/library/urllib.request.html#request-objects

要构造Request Object对象需要用到urllib.request.Request()方法：

https://docs.python.org/3/library/urllib.request.html#request-objects

代码示例：

# -*- coding:utf-8 -*-
from urllib import request, parse

# 使用urlopen()发起请求时，传入的参数并不能构造一个完整的请求，所以有了urllib.request.Request对象
url = 'https://httpbin.org/post'
data = bytes(parse.urlencode({'data': '请求数据'}), encoding='utf-8')
headers = {
    'Host': 'httpbin.org'
}
req = request.Request(url=url, data=data, headers=headers)
res = request.urlopen(req)
print(res.read().decode('utf-8'))