Web编程之二 urllib

最新推荐文章于 2025-02-13 15:44:53 发布

原创最新推荐文章于 2025-02-13 15:44:53 发布 · 1.8k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #web

开发同时被 2 个专栏收录

29 篇文章

订阅专栏

python

12 篇文章

订阅专栏

本文介绍了 Python 中的 urllib 模块，详细讲解了 urlopen 和 urlretrieve 函数的使用方法，包括如何下载文件、处理回调函数等。同时，还探讨了 quote、unquote 和 urlencode 函数的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Web编程之二 urllib

Urllib模块提供了在给定URL地址下载数据的功能，同事也可以通过字符串的编码、解码来确保它们是有效URL字符串中的一部分。

核心urllib模块函数

urlopen(urlstr,postQueryData=None)  打开urlstr,如果必要则通过postQueryData发送请求

urlretrieve(url,filename=None,      将url定位的文件下载到filename或临时文件中
reporthook=None，data=None)         如果存在reporthook将会获得下载统计信息

quote(urldata,safe='/')              将urldata中的无效url字符编码，safe列不必编码

unquote(urldata)                     将urldata中编码后的字符解码

quote_plus(urldata,safe='/')         将空格变异成+，其他与quote相同

unquote_plus(urldata)                将+编译成空格，其他与unquote相同

urlencode(dict)                      将字典键值对编译成有效的CGI请求字符串

urllib函数

urlopen

urlopen打开一个给定URL字符串与Web相连，并返回文件类的对象。

urlopen(urlstr,postQueryData=None)

一旦连接成功，就返回一个文件类型对象。这些文件类型对象的方法如下

f.read([bytes]) 从f中读取所有或bytes个字节
f.readline()    从f中读取一行
f.readlines()   从f中读取所有行，并返回一个列表
f.close()       关闭f的文g件句柄
f.info()        获得f的MIME头文件
f.geturl()      返回f所打开的真正url

例子

>>> import urllib
>>> url='http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'
>>> f=urllib.urlopen(url)
>>> f.info()
    <httplib.HTTPMessage instance at 0x10e2a1f38>
>>> f.geturl()
    'http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'
>>> data=f.read()
>>> len(data)
    684547
>>> file=open('httpd.tar.gz','w')
>>> file.write(data)
>>> file.close()
>>> f.close()

>>> url='http://www.126.com'
>>> f=urllib.urlopen(url)
>>> lines=f.readlines()
>>> for line in lines:print line

urlretrieve

urlretrieve(url, filename=None, reporthook=None, data=None)

url是下载的目标路径
filename是本地的绝对路径
reporthook 是回调函数
返回值为tuple(filename,header)

reporthook(block_read,block_size,total_size)定义回调函数。当每个数据块传输完成，该回调函数都会被调用。其中，
blocksize是每次读取的数据块的大小，
blockread是每次读取的数据块个数，
taotal_size是一一共读取的数据量，单位是byte

eg 1:

import urllib
import os

url='http://www.baidu.com'
try:
    filename=urllib.urlretrieve(url)[0]
except IOError:
    print "Cant open the url..."

f=open(filename)
lines=f.readlines()
for line in lines:print line
f.close()

eg 2:

import urllib
import os

url='http://mirror.esocc.com/apache//httpd/httpd-2.4.9.tar.bz2'

def reporthook(block_read,block_size,total_size):
    if not block_read:
        print "connection opened";
        return
    if total_size<0:
        #unknown size
        print "read %d blocks (%dbytes)" %(block_read,block_read*block_size);
    else:
        amount_read=block_read*block_size;
        print 'Read %d blocks,or %d/%d' %                   (block_read,block_read*block_size,total_size);
        return

try:
    filename,header = urllib.urlretrieve(url,reporthook = reporthook)
except IOError:
    print "Cant open the url..."

print 'filename:%s'%filename
print 'file header:%s'%header

print "exists?",os.path.exists(filename);

结果输出：

Read 1 blocks,or 8192/4994460
...
Read 610 blocks,or 4997120/4994460
filename:/var/folders/k3/z8ly7gy9635f58bbsdcrbr0c0000gn/T/tmpE5mxzo.bz2
file header:Date: Sun, 23 Mar 2014 13:42:24 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Sun, 16 Mar 2014 17:22:21 GMT
ETag: "b2e0d27-4c359c-4f4bc8ba94d40"
Accept-Ranges: bytes
Content-Length: 4994460
Connection: close
Content-Type: application/x-bzip2

exists? True

quote*()

quote(urldata,safe='/')

quote* 函数获取URL数据，并将其编码，从而用于URL字符串中。

>>> base='http://www/~foo/cig-bin/s.py'
>>> name='Joe mama'
>>> num=6
>>> final='%s?name=%s&num=%d'%(base,name,num)
>>> final
'http://www/~foo/cig-bin/s.py?name=Joe mama&num=6'

>>> urllib.quote(final)
'http%3A//www/%7Efoo/cig-bin/s.py%3Fname%3DJoe%20mama%26num%3D6'

>>> urllib.quote_plus(final,'/')
'http%3A//www/%7Efoo/cig-bin/s.py%3Fname%3DJoe+mama%26num%3D6'

>>> urllib.quote_plus(final)
'http%3A%2F%2Fwww%2F%7Efoo%2Fcig-bin%2Fs.py%3Fname%3DJoe+mama%26num%3D6'

unquote*()

unquote*()函数，将所有编码为"%xx"形式的字母都转成对应的ASCII码。

>>> quoted
'http%3A%2F%2Fwww%2F%7Efoo%2Fcig-bin%2Fs.py%3Fname%3DJoe+mama%26num%3D6'
>>> urllib.unquote(quoted)
'http://www/~foo/cig-bin/s.py?name=Joe+mama&num=6'
>>> urllib.unquote_plus(quoted)
'http://www/~foo/cig-bin/s.py?name=Joe mama&num=6'

urlencode()

urlopen()函数接收字典值的键值对，并将其编译成CGI请求的URL字符串的一部分。键值对的格式是"键=值"，以连接符&划分，并传递到quote_plus中进行适当的编码。

>>> dict={'name':'Joe mama','dir':'~/tem'}
>>> urllib.urlencode(dict)
'name=Joe+mama&dir=%7E%2Ftem'

pathname2url,url2pathname

这两个函数一般不直接用，而是在提供的以统一的方法定位网络和本地资源的接口函数中使用

* urllib.pathname2url(path)：将本地路径转换成url路径；
* urllib.url2pathname(path)：将url路径转换成本地路径；

eg:

>>> data=urllib.pathname2url('d:\Programe Files\phpcode\login.php')
>>> print data
d%3A%5CPrograme%20Files%5Cphpcode%5Clogin.php
>>> urllib.url2pathname(data)
'd:\\Programe Files\\phpcode\\login.php'

urllib中的类

CLASSES
    URLopener
        FancyURLopener

class FancyURLopener(URLopener)
 |  Derived class with handlers for errors we can handle (perhaps).
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *args, **kwargs)
 |  
 |  get_user_passwd(self, host, realm, clear_cache=0)
 |  
 |  http_error_xxx(self, url, fp, errcode, errmsg, headers, data=None)
 |      Error xxx -- also relocated (permanently).
 |
 |  http_error_default(self, url, fp, errcode, errmsg, headers)
 |      Default error handling -- don't raise an exception.
 |  
 |  prompt_user_passwd(self, host, realm)
 |      Override this in a GUI environment!
 |  
 |  redirect_internal(self, url, fp, errcode, errmsg, headers, data)
 |  
 |  retry_http_basic_auth(self, url, realm, data=None)
 |  
 |  retry_https_basic_auth(self, url, realm, data=None)
 |  
 |  retry_proxy_http_basic_auth(self, url, realm, data=None)
 |  
 |  retry_proxy_https_basic_auth(self, url, realm, data=None)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from URLopener:
 |  
 |  __del__(self)
 |  
 |  addheader(self, *args)
 |      Add a header to be used by the HTTP interface only
 |      e.g. u.addheader('Accept', 'sound/basic')
 |  
 |  cleanup(self)
 |  
 |  close(self)
 |  
 |  http_error(self, url, fp, errcode, errmsg, headers, data=None)
 |      Handle http errors.
 |      Derived class can override this, or provide specific handlers
 |      named http_error_DDD where DDD is the 3-digit error code.
 |  
 |  open(self, fullurl, data=None)
 |      Use URLopener().open(file) instead of open(file, 'r').
 |  
 |  open_data(self, url, data=None)
 |      Use "data" URL.
 |  
 |  open_file(self, url)
 |      Use local file or FTP depending on form of URL.
 |  
 |  open_ftp(self, url)
 |      Use FTP protocol.
 |  
 |  open_http(self, url, data=None)
 |      Use HTTP protocol.
 |  
 |  open_https(self, url, data=None)
 |      Use HTTPS protocol.
 |  
 |  
 |  open_local_file(self, url)
 |      Use local file.
 |  
 |  open_unknown(self, fullurl, data=None)
 |      Overridable interface to open unknown URL type.
 |  
 |  open_unknown_proxy(self, proxy, fullurl, data=None)
 |      Overridable interface to open unknown URL type.
 |  
 |  retrieve(self, url, filename=None, reporthook=None, data=None)
 |      retrieve(url) returns (filename, headers) for a local object
 |      or (tempfilename, headers) for a remote object.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from URLopener:
 |  
 |  version = 'Python-urllib/1.17'