python爬虫之urllib-优快云博客

urllib库简介

Python2有urllib2 和 urllib两个库来实现请求的发送，如今在python3中urllib2和urllib库已经统一为urllib库。Urllib库她是python内置的http请求库，即不需要额外的安装就可以使用。Urllib包含有以下四个模块：

Request：urllib库中最基本的http请求模块，用来模拟发送请求，只需传入url和额外的参数即可模拟浏览器像服务器发送请求。
Error:urllib库的异常处理模块，若出现请求错误，即可捕获这些异常从而保证程序不会意外终止。
Parse:urllib库的一个工具模块，提供很多URL处理方法，如拆分、解析、合并。
Robotparser:用来识别网络的robots.txt文件，然后判断哪些网站可以爬取，哪些网站不可以爬取（一般实际中用的比较少）

request模块

urlopen()方法实现简单的请求和网页抓取，最基本的简单网页get请求抓取，以python官网为例：

import urllib.request
response = urllib.request.urlopen("https://www.python.org")
print(response.read().decode("utf-8"))
print(type(response))
输出结果如下：
<class 'http.client.HTTPResponse'>

View Code

由以上代码可知返回一个HTTPResponse类型的对象，我们将它赋值为response变量，然后调用该类型对象的方法和属性即可得到一系列相关信息：

HTTPResponse类型的对象主要包含的方法和属性有：

方法：read(),readinto(),getheader(name),getheaders(),fileno()等
属性：msg,version,status,reason,debuglevel,closed等

request模块urlopen（）方法含有参数的网页抓取：

Urlopen()函数的API如下：

Urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
Data参数：附加数据，添加该参数，需要使用bytes（）方法将参数转换为字节流编码格式的内容，并且该请求方式不再是Get而变成Post。
Timeout参数：
Context参数：必须为ssl.SSLContext类型，用来指定SSL设置。
Cafile参数：CA证书
Capath参数：CA证书的路径
Cadefault参数：现在已经弃用，默认值为False
除url外其余皆为可选参数，常用的data参数为附加数据，timeout 请求时间

data参数：添加data参数请求http://httpbin.org/post,此链接可以用来测试POST请，可输出请求信息，包含所传递的data参数

from  urllib.request import urlopen
import urllib.parse
#Urllib.parse.urlencode()方法将字典转换为字符串
#Bytes()方法将字符串转换为字节流
data = bytes(urllib.parse.urlencode({'work':'Hello'}),encoding='utf-8')
response = urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

结果如下：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "work": "Hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "1.85.205.85", 
  "url": "http://httpbin.org/post"
}

View Code

Timeout参数：改参数用于设置超时时间（单位为秒），如果请求超出设置的这个时间还未得到响应，则会跑出异常。若不指定改参数则使用全局默认时间。

请求http://httpbin.org/get测试链接，设置请求超时时间为0.1秒，由于在0.1秒内服务器不可能得到服务器响应，因而会抛出异常URLError,若该异常为socket.timeout即超时异常，则打印Time Out.

from urllib import request,error
import socket
try:
    response = request.urlopen('http://httpbin.org/get',timeout = 0.1)
except error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('Time Out')
结果：
    Time Out

View Code

Request模块之Request类

我们依然用urlopen（）方法来发送请求，但是该方法的参数不再是URL,而是一个Request类型的对象。通过构造这个数据结构，不仅可以以将请求独立成一个对象，而且如可以隔间丰富和灵活的配置参数。

Request 类的构造方法如下：

Class urllib.request.Request(url,data=None,header={},origin_req_host=None,unverifiable=False,method=None)
参数url:用于请求URL，必传参数。
参数data：可选参数，参数传递必须是bytes（字节型）类型的，如果是字典，则需先使用urllib.parse模块的urlencode（）方法进行编码
参数headers:可选参数，请求头且是字典格式。常用用法是通过修改User-agent来伪装浏览器。
参数origin_req_host:可选参数，指请求方的host名称或者ip地址。
参数unverifiable：可选参数，表示这个请求是否是无法验证的，默认为False。
参数Method：可选参数是一个字符串，表示请求使用的方法例如GET,POST 和 PUT等

简单代码示例：

import urllib.request
req = urllib.request.Request('http://python.org')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

View Code

参数传入示例：

from urllib import request,parse

url = "http://httpbin.org/post"
headers = {
    'User-Agent': 'Mozilla/4.0(compatible;MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name' : 'Germey'
}

data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
结果：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "1.85.205.85", 
  "url": "http://httpbin.org/post"
}

View Code

Request模块工具Handler

像Cookies处理，代理设置等，则需更强大的工具Handler，它有专门处理登录验证、cookies、代理设置的类。在urllib.request内有一个BaseHandler类，它是其他所有Handler的父类，常用的有以下几个子类：

HTTPDefaultErrorHandler：用于处理Http响应错误，会抛出HTTPError类型的异常。
HTTPCookieProcessor；用于处理Cookies
ProxyHandler：用于设置代理，默认代理为空
HTTPPasswordMgr：用于管理密码，维护用户和密码表
HTTPBasicAuthHandler：用于管理认证，即在一个链接打开需要认证，则可以使用它来解决认证问题

网站登录验证（即登录网站需要输入用户名和密码），代码示例：

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

#实例化HTTPBasicAuthHandler对象
p = HTTPPasswordMgrWithDefaultRealm()
#添加用户名和密码
p.add_password(None,url,username,password)
#建立处理验证的handler
auth_handler = HTTPBasicAuthHandler(p)
#构建opener
opener = build_opener(auth_handler)
try:
#使用opener的open（）方法打开链接完成验证
    result = opener.open(url)
#获取验证后页面的源码
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

View Code

代理设置，代码示例：

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
#设置代理
proxy_handler = ProxyHandler({
    'http' : 'http://127.0.0.1:9743',
    'https' : 'https//127.0.0.1:9743'
})
#构造Opener
opener = build_opener(proxy_handler)
try:
    #发送请求
    response = opener.open('https:www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

View Code

Cookies获取，代码示例：

import http.cookiejar,urllib.request
#声明cookieJar对象
cookie = http.cookiejar.CookieJar()
#构建handler
handler = urllib.request.HTTPCookieProcessor(cookie)
#利用build_opener()方法构建opener
opener = urllib.request.build_opener(handler)
#执行opener函数
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + '=' + item.value)
结果：
BAIDUID=1FEB885D7B33FA158112D324A43A0BC7:FG=1
BIDUPSID=1FEB885D7B33FA158112D324A43A0BC7
H_PS_PSSID=1990_1458_21116_28131_26350_27751_28140_22157
PSTM=1545228628
delPer=0
BDSVRTM=0
BD_HOME=0

View Code

使用MozillaCookieJar将Cookie保存成Mozilla型浏览器的Cookies格式：

import http.cookiejar,urllib.request

filename = 'cookies.txt'
#声明cookieJar对象
cookie = http.cookiejar.MozillaCookieJar(filename)
#构建handler
handler = urllib.request.HTTPCookieProcessor(cookie)
#利用build_opener()方法构建opener
opener = urllib.request.build_opener(handler)
#执行opener函数
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

Cookies.txt文件内容如下：
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com TRUE   /  FALSE  3692712603 BAIDUID    7F25DC1378FA5F15462633205F9F8AFD:FG=1
.baidu.com TRUE   /  FALSE  3692712603 BIDUPSID   7F25DC1378FA5F15462633205F9F8AFD
.baidu.com TRUE   /  FALSE     H_PS_PSSID 26525_1448_21085_28131_27751_28139_27542
.baidu.com TRUE   /  FALSE  3692712603 PSTM   1545228954
.baidu.com TRUE   /  FALSE     delPer 0
www.baidu.com  FALSE  /  FALSE     BDSVRTM    0
www.baidu.com  FALSE  /  FALSE     BD_HOME    0

View Code

LWPCookieJar同样可以保存和读取Cookie文件，代码示例如下：

import http.cookiejar,urllib.request

filename = 'lwp_cookies.txt'
#声明cookieJar对象
cookie = http.cookiejar.LWPCookieJar(filename)
#构建handler
handler = urllib.request.HTTPCookieProcessor(cookie)
#利用build_opener()方法构建opener
opener = urllib.request.build_opener(handler)
#执行opener函数
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

lwp_cookies.txt文件内容如下：
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="BA7B220FE2C25EDD724D2AB77AFEE19A:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-01-06 17:36:58Z"; version=0
Set-Cookie3: BIDUPSID=BA7B220FE2C25EDD724D2AB77AFEE19A; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-01-06 17:36:58Z"; version=0
Set-Cookie3: H_PS_PSSID=1449_21094_28132_27751_28139_22158; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1545229369; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-01-06 17:36:58Z"; version=0
Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

View Code

从文件中读取cookies,以LWPCookieJar格式的cookies文件

lwp_cookies.txt 为例，代码示例如下：

 1 import http.cookiejar,urllib.request
 2 
 3 filename = 'lwp_cookies.txt'
 4 #声明cookieJar对象
 5 cookie = http.cookiejar.LWPCookieJar(filename)
 6 cookie.save('lwp_cookies.txt',ignore_discard=True,ignore_expires=True)
 7 handler = urllib.request.HTTPCookieProcessor(cookie)
 8 opener = urllib.request.build_opener(handler)
 9 response = opener.open('http://www.baidu.com')
10 print(response.read().decode('utf-8'))

View Code

异常处理模块-error

URLError类来自urllib库的error模块,继承自OSError类，是error异常模块的基类，request模块产生的异常都可以通过捕获这个error模块的URLError类来处理。

URLError属性Reason：返回错误的原因

URLError子类HTTPError类的属性：专门用来处理http请求的错误，如认证失败，其属性有：

Code：返回状态码
Reason：同父类一样，用于返回错误原因,reason属性返回一个对象
Headers：返回请求头

URLError类的属性reason，打开一个存在的页面，代码示例如下：

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.html')
except error.URLError as e:
    print(e.reason)
结果：Not Found

View Code

HTTPError类的属性，捕获reason，code，header异常，代码示例如下：

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.html')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep = '\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully.')
结果：
Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Wed, 19 Dec 2018 14:55:22 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

View Code

reason属性返回一个对象，代码示例：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com',timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout):
        print('Tiem Out')
结果：
<class 'socket.timeout'>
Tiem Out

View Code

Parse 模块

Parse模块定义了处理URL的标准接口，可以实现URL各部分的抽取、合并以及链接转换。支持的URL协议有：file、ftp、gopher、hdl、http、https、imap、telnet、sftp、rsync、svn。常用的方法如下：

Urlparse()方法：可实现URL的识别和分段，参数有：

Urlstring:必填选项，待解析的url
Scheme：默认协议（例如http或https），如果链接未带协议信息，则会将这个作为默认协议，若URL 有scheme信息，则返回解析出的scheme
Allow_fragments：即是否忽略fragment，若设置为false则忽略，它会别解析为path、query或parameters的一部分，而fragment部分为空

参数url，代码示例：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

结果：

<class 'urllib.parse.ParseResult'>

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

View Code

参数scheme，代码示例：

from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=#5comment',scheme='https')
print(type(result))
print(result)

结果：
<class 'urllib.parse.ParseResult'>

#返回结果元祖
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=', fragment='5comment')

View Code

参数allow_fragments，代码示例:

from urllib.parse import urlparse

result = urlparse('http:www.baidu.com/index.html;user?id=#5comment',allow_fragments=False)
print(type(result))
print(result)
结果：
<class 'urllib.parse.ParseResult'>
#返回结果元组
ParseResult(scheme='http', netloc='', path='www.baidu.com/index.html', params='user', query='id=#5comment', fragment='')

View Code

参数allow_fragments，不含param和query参数，fragments会被解析为path的一部分，代码示例：

from urllib.parse import urlparse

result = urlparse('http:www.baidu.com/index.html#comment',allow_fragments=False)
print(type(result))
print(result)

结果：
<class 'urllib.parse.ParseResult'>
#返回结果元组
ParseResult(scheme='http', netloc='', path='www.baidu.com/index.html#comment', params='', query='', fragment='')

View Code

urlunparse（）方法与urlparse相反，接受的参数为一个可迭代的对象且长度必须是6个

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
result = urlunparse(data)
print(type(result))
print(result)

结果：
<class 'str'>
http://www.baidu.com/index.html;user?a=6#comment

View Code

urlsplit()方法：该方法与urlparse相似，但不再单独解析params这一部分，只返回5个结果

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)
print(result.scheme,result[0])

结果：
<class 'urllib.parse.SplitResult'>
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
http http

View Code

urlunsplit()方法：该方法与urlunparse方法类似将各个部分链接组合成完整链接的方法，传入的参数为一个可迭代对象且长度必须是5

from urllib.parse import urlunsplit

data = ['http','www.baidu.com','index.html','a=6','comment']
result = urlunsplit(data)
print(type(result))
print(result)

结果：
<class 'str'>
http://www.baidu.com/index.html?a=6#comment

View Code

urljoin()方法：该方法与urlunparse和urlunsplit不同，该方法的第一个参数为基础url链接，将第二个参数作为新的链接，它会分析基础url链接scheme、netloc和path这三个内容，如果这三项在新的链接里不存在则予以补充，如果新的链接存在就使用新的链接部分，而基础链接中的param、query、fragment这三个参数是不起作用的

from urllib.parse import urljoin

print(urljoin("http://www.baidu.com",'FAQ.html'))
print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com#comment','?category=2'))
结果：
http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

View Code

urlencode（）方法：用于构造Get请求参数，例如用一个字典将Get请求参数表示出来，则就需要使用urlencode（）方法将其序列化为get请求参数

from urllib.parse import urlencode

params = {
    'name' : 'germey',
    'age' : 27
}
base_url = 'http://www.baidu.com'
url = base_url + urlencode(params)
print(url)

结果：
http://www.baidu.comname=germey&age=27

View Code

parse_qs()方法：将一串Get请求参数转回字典

from urllib.parse import parse_qs

query = 'name=germey&age=2'
print(parse_qs(query))

结果：
{'name': ['germey'], 'age': ['2']}

View Code

Parse_qsl()方法：将请求参数转换为元组组成的列表

from urllib.parse import parse_qsl

query = 'name=germey&age=2'
print(parse_qsl(query))

结果：
[('name', 'germey'), ('age', '2')]

View Code

quote()方法：当URL 中带有中文参数时则有可能会导致乱码问题，可以使用这个方法将中文转换为URL编码格式

from urllib.parse import quote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
结果：
https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

View Code

unquote()方法：对URL编码进行解码

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

结果：
https://www.baidu.com/s?wd=壁纸

View Code

Robotparser模块

Robotparse模块是对网站的Robot协议进行分析。Robot协议也称为爬虫协议，其全名为网络爬虫排除标准，来告诉爬虫和搜索引擎哪些网页可以抓取，哪些网页不可以抓取，通常有一个robots.txt的文本文件存放在网站的根目录下。每当有爬虫访问网站时，首先会检查是否存在这个文件，若存在，则搜索爬虫会根据定义的范围进行爬取，若不存在这个文件，搜索爬虫便会访问所有可以直接访问的页面。

Robot.txt 文件样例：

User-Agent : * （描述搜索爬虫的名称，*表示对所有爬虫都有效，若设置为Baiduspider则该规则对白度爬虫有效，若有多个则会有多个爬虫受到限制）
Disallow : / （不允许抓取的目录）
Allow : /public/ （允许抓取的目录）

robotparser模块包含的方法：

set_url():设置robots.txt 文件链接
read():读取robot.txt文件并进行分析，该方法执行一个读取和分析操作，若不调用该方法，则接下来的判断都为False,而且该方法不会返回任何内容
parse():解析parse()文件，传入参数为robot.txt某些行的内容
can_fetch():传入两个参数，第一个参数为User-Agent,第二个参数为要抓取的url，返回结果为True或False，即代表该搜索时引擎是否可以抓取这个URL
mtime():返回结果为上次抓取和分析robot.txt 的时间，当长时间分析和爬取的搜索爬虫是有必要的
Modified():将当前时间设置为上次抓取和分析robot.txt文件的时间，对长时间分析和抓取的搜索爬虫有很大的帮助

代码样例：

from urllib.robotparser import RobotFileParser
#创建RobotFileParser对象
rp = RobotFileParser()
#设置robots.txt链接
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
#判断网页是否可以抓取
print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*','http://www.jianshu.com/search?q=python&page=1&type=collections'))

结果：
False
False