最近学习写Python爬虫的实战,会经常用到urllib2库中的函数。因此转载自http://www.cnblogs.com/youxin/archive/2013/05/07/3064434.html。
urllib2是一个类似curl的Python扩展,默认已经安装。官网:http://docs.python.org/2/library/urllib2.html
The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
urllib2.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
This function returns a file-like object with two additional methods。
class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])¶
This class is an abstraction of a URL request.
使用如下:
import urllib2
req=urllib2.Request("http://www.baidu.com") //urllib2.Request(url)
response=urllib2.urlopen(req)
html=response.read()
或者
import urllib2html = urllib2.urlopen('http://piratebay.se/browse/200').read()
Post 传参
import urllib
import urllib2
url='http://localhost/php/GetPost.php'
dataArr={'name':'jack','password':'pass'}
data=urllib.urlencode(dataArr)#encode a sequence of two tuple or dict
req=urllib2.Request(url,data)
res=urllib2.urlopen(req)
html=res.read()
Get传参
import urllib
import urllib2
url='http://localhost/php/GetPost.php'
dataArr={'name':'jack','password':'pass'}
data=urllib.urlencode(dataArr)#encode a sequence of two tuple or dict
full_url = url + '?' + data
req=urllib2.Request(full_url)
res=urllib2.urlopen(req)
html=res.read()
设置Header到Http请求
有一些站点不喜欢被非人为地访问,或者发送不同的版本给不同的浏览器。
默认的urllib2把自己作为‘Python-urllib/x.y’如'‘Python-urllib./2.7'
这身份可能会使站点身份迷惑,或者干脆不工作。
浏览器确认自己身份是通过User-Agent头(http://baike.baidu.com/view/3398471.htm),当你创建了一个请求对象,你可以给他一个包含头数据的字典。
下面的例子发送跟上面一样的内容,但把自身模拟成Internet Explorer。
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
<strong>
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' </strong>
values = {'name' : 'WHY',
'location' : 'SDU',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()