Urllib模块详解-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_43171448/article/details/108000986

Urllib

urllib有四大模块：request，error，parse，robotparser。

request：http请求模块。

error：异常处理模块。

parse：工具模块，用来处理URL。

robotparser：识别robots.txt判断哪些网站可以爬取。

request

urlopen()方法，该方法返回的是一个HTTPResposne对象。

参数：【url】要爬取的网页地址 string

【data】请求要传递的参数 bytes

【timeout】用于设置超时时间，以秒为单位 int

import urllib.parse 
import urllib.request 
# urlopen方法的data是bytes类型的，使用bytes方法转化 
# bytes方法第一个参数是string类型的，使用parse.urlencode方法将字典转换成string类型，第二个参数是制定编码格式 
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8') 
response = urllib.request.urlopen('http://httpbin.org/post', data=data) 
print(response.read())

* 番外篇：HTTPResposne对象

HTTPResposne 类型的对象，主要包含 read()、 readinto()、 getheader(name）、 getheaders() 、fileno()等方法，以及 msg、 version、 status 、 reason 、 debuglevel、closed 等属性。

* Request类

Request的构造函数如下：

class urllib.request.Request ( url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

参数：【url】请求的url，必传参数 string

【data】请求传入的参数 bytes （如果时字典，要使用urllib.parse模块里的urlencode()编码）

【headers】发送请求时的请求头 dict（可以在Request实例化的时候传入headers参数，也可以通过调用实例的add_header()方法添加）

【origin_req_host】指的是请求方的host名称或者ip地址

【unverifiable】表示这个请求是否无法验证 bool

【method】请求使用的方法 GET, POST, PUT…

from urllib import request, parse 
url = "http://httpbin.org/post" 
# headers = { 
#     'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 
#     'Host': 'httpbin.org' 
# } 
dict = { 
    'name': 'Germey' 
} 
# 对dict进行bytes处理,先使用parse.urlcode进行字典编码 
data = bytes(parse.urlencode(dict), encoding='utf8') 
# 创建实例 
# req = request.Request(url=url, data=data, headers=headers, method='POST') 
req = request.Request(url=url, data=data, method='POST') 
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') 
# 发送请求 
response = request.urlopen(req)  # 返回HTTPResposne 类型的对象 
print(response.read().decode('utf-8'))

* Handler和Opener

Handler 子类继承这个 BaseHandler 父类，以下都为BaseHandler的子类。

HITPDefaultErrorHandler：用于处理HTTP 响应错误，错误都会抛出 HTTPError类型的异常。
HTTPRedirectHandler：用于处理重定向。
HTTPCookieProcessor：用于处理 Cookies。
ProxyHandler：用于设置代理，默认代理为空。
HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表。
HTTPBasicAuthHandler ：用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题

OpenerDirector类，可以称之为Opener，之前使用的urlopen()这个方法，实际上时urllib为我们提供的一个Opener。

# HTTPBasicAuthHandler用于【管理验证】的网站 
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener 
from urllib.error import URLError 
username = 'username' 
password = 'password' 
url = 'http://localhost:5000/' 
# 实例化对象p 
p = HTTPPasswordMgrWithDefaultRealm() 
# p使用add_password添加用户名密码进去 
p.add_password(None, url, username, password) 
# 使用p为参数实例化HTTPBasicAuthHandler实例auth_handler，这样就创建了一个处理验证的Handler ： auth_handler 
auth_handler = HTTPBasicAuthHandler(p)  #管理认证handler 
# 利用这个handler，使用build_opener()方法构建一个Opener 
opener = build_opener(auth_handler) 
try: 
    result = opener.open(url) 
    # 这里获取到的结果是通过验证之后的页面源码内容 
    html = result.read().decode('utf-8') 
    print(html) 
except URLError as e: 
    print(e.reason)

* cookies

import http.cookiejar 
import urllib.request 
# 先声明一个CookieJar对象 
cookie = http.cookiejar.CookieJar() 
# 使用CookieJar对象创建HTTPCookieProcessor Handler 
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com/') 
for item in cookie: 
    print(item.name + "=" + item.value) 
# 将读取到的cookie保存到文件 
filename = 'cookies.txt' 
cookie = http.cookiejar.MozillaCookieJar(filename) 
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com/') 
cookie.save(ignore_discard=True, ignore_expires=True) 
# 将读取到的cookie保存为LWP格式 
filename = 'cookies.txt' 
cookie = http.cookiejar.LWPCookieJar(filename) 
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com/') 
cookie.save(ignore_discard=True, ignore_expires=True) 
# 使用cookie发送请求 
filename = 'cookies.txt' 
cookie = http.cookiejar.LWPCookieJar() 
# 使用load（）方法来读取本地的Cookies文件 
cookie.load(filename, ignore_discard=True, ignore_expires=True) 
handler = urllib.request.HTTPCookieProcessor(cookie) 
# 创建opener，只有opener.open()方法才能发送请求 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com/') 
print(response.read().decode('utf-8'))

error
- URLError

URLError是error异常模块的基类。

# urllib的error模块定义了request产生的异常 
# 没有这个页面 
from urllib import request, error 
try: 
    response = request.urlopen('https://cuiqingcai.com/index.htm') 
except error.URLError as e: 
    print(e.reason)

* HTTPError

HTTPError是URLError的子类，专门粗粒HTTP请求错误。他有3个属性：

code：返回 HTTP 状态码，比如 404 表示网页不存在， 500 表示服务器内部错误等。
reason：同父类一样，用于返回错误的原因
headers：返回请求头

from urllib import request, error 
try: 
    response = request.urlopen('https://cuiqingcai.com/index.htm') 
#     先捕获子类HTTPError的异常 
except error.HTTPError as e: 
    print(e.reason, e.code, e.headers, sep='/n') 
#     再捕获父类URLError的异常 
except error.URLError as e: 
    print() 
    print(e.reason) 
else: 
    print('Request Successfully')

parse

该模块用来解析各种URL。

* urlparse()

参数：【urlstring】待解析的URL string

【scheme】默认的协议（http,https,ftp…）

【allow_fragments】是否忽略锚点 bool

from urllib.parse import urlparse  
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') 
print(type(result), result) 
--------------------------------------- 
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

* urlunparse()

from urllib.parse import urlunparse 
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] 
print(urlunparse(data)) 
------------------------------------------------- 
http://www.baidu.com/index.html;user?a=6#comment

* urlsplit() 
* urlunsplit() 
* urljoin() 
* parse_qs()

# 反序列化，把一串get请求参数字符串转化为字典parse_qs 
from urllib.parse import parse_qs 
query = 'name=germey&age=22' 
print(parse_qs(query))

* parse_qsl()

# 反序列化，把一串get请求参数字符串转化为元组列表parse_qsl() 
from urllib.parse import parse_qsl 
query = 'name=germey&age=22' 
print(parse_qsl(query))

* urlencode()

# 使用get请求时，使用urlencode()方法将字典类型的params转化为get请求参数 
from urllib.parse import urlencode 
params = { 
    'name': 'germey', 
    'age': 22 
} 
base_url = 'http://www.baidu.com?' 
url = base_url + urlencode(params) 
print(url)

* quote()  中文编码url

# 中文字符串转化为URL编码quote 
from urllib.parse import quote 
keyword = '壁纸' 
url = 'https://www.baidu.com/s?wd=' + quote(keyword) 
print(url)

* unquote()  中文解码url

# 对URL进行中文解码unquote 
from urllib.parse import unquote 
url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8' 
print(unquote(url))

robotparser

robots.txt

User-agent: *  
Disallow: / 
Allow: /public/

代码

# robots协议 
# 判断目录能否爬虫 
from urllib.robotparser import RobotFileParser 
# 创建RobotFileParser实例 
rp = RobotFileParser() 
# 然后在实例里添加url 
rp.set_url('http://www.jianshu.com/robots.txt') 
# 要read()读取一下才能进行下面操作 
rp.read() 
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) 
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

urllib爬虫入门库

Urllib