Web编程之二 urllib
Urllib模块提供了在给定URL地址下载数据的功能,同事也可以通过字符串的编码、解码来确保它们是有效URL字符串中的一部分。
核心urllib模块函数
urlopen(urlstr,postQueryData=None) 打开urlstr,如果必要则通过postQueryData发送请求
urlretrieve(url,filename=None, 将url定位的文件下载到filename或临时文件中
reporthook=None,data=None) 如果存在reporthook将会获得下载统计信息
quote(urldata,safe='/') 将urldata中的无效url字符编码,safe列不必编码
unquote(urldata) 将urldata中编码后的字符解码
quote_plus(urldata,safe='/') 将空格变异成+,其他与quote相同
unquote_plus(urldata) 将+编译成空格,其他与unquote相同
urlencode(dict) 将字典键值对编译成有效的CGI请求字符串
urllib函数
urlopen
urlopen打开一个给定URL字符串与Web相连,并返回文件类的对象。
urlopen(urlstr,postQueryData=None)
一旦连接成功,就返回一个文件类型对象。这些文件类型对象的方法如下
f.read([bytes]) 从f中读取所有或bytes个字节
f.readline() 从f中读取一行
f.readlines() 从f中读取所有行,并返回一个列表
f.close() 关闭f的文g件句柄
f.info() 获得f的MIME头文件
f.geturl() 返回f所打开的真正url
例子
>>> import urllib
>>> url='http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'
>>> f=urllib.urlopen(url)
>>> f.info()
<httplib.HTTPMessage instance at 0x10e2a1f38>
>>> f.geturl()
'http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'
>>> data=f.read()
>>> len(data)
684547
>>> file=open('httpd.tar.gz','w')
>>> file.write(data)
>>> file.close()
>>> f.close()
>>> url='http://www.126.com'
>>> f=urllib.urlopen(url)
>>> lines=f.readlines()
>>> for line in lines:print line
urlretrieve
urlretrieve(url, filename=None, reporthook=None, data=None)
url是下载的目标路径
filename是本地的绝对路径
reporthook 是回调函数
返回值为tuple(filename,header)
reporthook(block_read,block_size,total_size)
定义回调函数。当每个数据块传输完成,该回调函数都会被调用。其中,
blocksize是每次读取的数据块的大小,
blockread是每次读取的数据块个数,
taotal_size是一一共读取的数据量,单位是byte
eg 1:
import urllib
import os
url='http://www.baidu.com'
try:
filename=urllib.urlretrieve(url)[0]
except IOError:
print "Cant open the url..."
f=open(filename)
lines=f.readlines()
for line in lines:print line
f.close()
eg 2:
import urllib
import os
url='http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'
def reporthook(block_read,block_size,total_size):
if not block_read:
print "connection opened";
return
if total_size<0:
#unknown size
print "read %d blocks (%dbytes)" %(block_read,block_read*block_size);
else:
amount_read=block_read*block_size;
print 'Read %d blocks,or %d/%d' % (block_read,block_read*block_size,total_size);
return
try:
filename,header = urllib.urlretrieve(url,reporthook = reporthook)
except IOError:
print "Cant open the url..."
print 'filename:%s'%filename
print 'file header:%s'%header
print "exists?",os.path.exists(filename);
结果输出:
Read 1 blocks,or 8192/4994460
...
Read 610 blocks,or 4997120/4994460
filename:/var/folders/k3/z8ly7gy9635f58bbsdcrbr0c0000gn/T/tmpE5mxzo.bz2
file header:Date: Sun, 23 Mar 2014 13:42:24 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Sun, 16 Mar 2014 17:22:21 GMT
ETag: "b2e0d27-4c359c-4f4bc8ba94d40"
Accept-Ranges: bytes
Content-Length: 4994460
Connection: close
Content-Type: application/x-bzip2
exists? True
quote*()
quote(urldata,safe='/')
quote* 函数获取URL数据,并将其编码,从而用于URL字符串中。
>>> base='http://www/~foo/cig-bin/s.py'
>>> name='Joe mama'
>>> num=6
>>> final='%s?name=%s&num=%d'%(base,name,num)
>>> final
'http://www/~foo/cig-bin/s.py?name=Joe mama&num=6'
>>> urllib.quote(final)
'http%3A//www/%7Efoo/cig-bin/s.py%3Fname%3DJoe%20mama%26num%3D6'
>>> urllib.quote_plus(final,'/')
'http%3A//www/%7Efoo/cig-bin/s.py%3Fname%3DJoe+mama%26num%3D6'
>>> urllib.quote_plus(final)
'http%3A%2F%2Fwww%2F%7Efoo%2Fcig-bin%2Fs.py%3Fname%3DJoe+mama%26num%3D6'
unquote*()
unquote*()函数,将所有编码为"%xx"形式的字母都转成对应的ASCII码。
>>> quoted
'http%3A%2F%2Fwww%2F%7Efoo%2Fcig-bin%2Fs.py%3Fname%3DJoe+mama%26num%3D6'
>>> urllib.unquote(quoted)
'http://www/~foo/cig-bin/s.py?name=Joe+mama&num=6'
>>> urllib.unquote_plus(quoted)
'http://www/~foo/cig-bin/s.py?name=Joe mama&num=6'
urlencode()
urlopen()函数接收字典值的键值对,并将其编译成CGI请求的URL字符串的一部分。键值对的格式是"键=值",以连接符&划分,并传递到quote_plus中进行适当的编码。
>>> dict={'name':'Joe mama','dir':'~/tem'}
>>> urllib.urlencode(dict)
'name=Joe+mama&dir=%7E%2Ftem'
pathname2url,url2pathname
这两个函数一般不直接用,而是在提供的以统一的方法定位网络和本地资源的接口函数中使用
* urllib.pathname2url(path):将本地路径转换成url路径;
* urllib.url2pathname(path):将url路径转换成本地路径;
eg:
>>> data=urllib.pathname2url('d:\Programe Files\phpcode\login.php')
>>> print data
d%3A%5CPrograme%20Files%5Cphpcode%5Clogin.php
>>> urllib.url2pathname(data)
'd:\\Programe Files\\phpcode\\login.php'
urllib中的类
CLASSES
URLopener
FancyURLopener
class FancyURLopener(URLopener)
| Derived class with handlers for errors we can handle (perhaps).
|
| Methods defined here:
|
| __init__(self, *args, **kwargs)
|
| get_user_passwd(self, host, realm, clear_cache=0)
|
| http_error_xxx(self, url, fp, errcode, errmsg, headers, data=None)
| Error xxx -- also relocated (permanently).
|
| http_error_default(self, url, fp, errcode, errmsg, headers)
| Default error handling -- don't raise an exception.
|
| prompt_user_passwd(self, host, realm)
| Override this in a GUI environment!
|
| redirect_internal(self, url, fp, errcode, errmsg, headers, data)
|
| retry_http_basic_auth(self, url, realm, data=None)
|
| retry_https_basic_auth(self, url, realm, data=None)
|
| retry_proxy_http_basic_auth(self, url, realm, data=None)
|
| retry_proxy_https_basic_auth(self, url, realm, data=None)
|
| ----------------------------------------------------------------------
| Methods inherited from URLopener:
|
| __del__(self)
|
| addheader(self, *args)
| Add a header to be used by the HTTP interface only
| e.g. u.addheader('Accept', 'sound/basic')
|
| cleanup(self)
|
| close(self)
|
| http_error(self, url, fp, errcode, errmsg, headers, data=None)
| Handle http errors.
| Derived class can override this, or provide specific handlers
| named http_error_DDD where DDD is the 3-digit error code.
|
| open(self, fullurl, data=None)
| Use URLopener().open(file) instead of open(file, 'r').
|
| open_data(self, url, data=None)
| Use "data" URL.
|
| open_file(self, url)
| Use local file or FTP depending on form of URL.
|
| open_ftp(self, url)
| Use FTP protocol.
|
| open_http(self, url, data=None)
| Use HTTP protocol.
|
| open_https(self, url, data=None)
| Use HTTPS protocol.
|
|
| open_local_file(self, url)
| Use local file.
|
| open_unknown(self, fullurl, data=None)
| Overridable interface to open unknown URL type.
|
| open_unknown_proxy(self, proxy, fullurl, data=None)
| Overridable interface to open unknown URL type.
|
| retrieve(self, url, filename=None, reporthook=None, data=None)
| retrieve(url) returns (filename, headers) for a local object
| or (tempfilename, headers) for a remote object.
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from URLopener:
|
| version = 'Python-urllib/1.17'