python urllib urllib2

最新推荐文章于 2021-01-14 21:43:40 发布

转载最新推荐文章于 2021-01-14 21:43:40 发布 · 85 阅读

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/liujitao79/p/5411962.html

本文详细介绍了Python中urllib和urllib2模块的主要功能和使用方法，包括urlopen、urlretrieve等函数的用法及Request类的实例化过程。对比了两者在HTTP请求处理上的差异，如设置headers、POST请求数据编码等方面的不同。

python urllib urllib2

区别
1) urllib2可以接受一个Request类的实例来设置URL请求的headers，urllib仅可以接受URL。这意味着，用urllib时不可以伪装User Agent字符串等。
2) urllib提供urlencode方法用来encode发送的data，而urllib2没有。这是为何urllib常和urllib2一起使用的原因。

urllib

1 urllib.urlopen(url[,data[,proxies]])

打开一个url的方法，返回一个文件对象

>>> req = urllib.urlopen('http://www.baidu.com')
>>> req.readline() # 读取一行
'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\n'

urlopen返回对象提供方法：
- read() , readline() ,readlines() , fileno() , close()：这些方法的使用方式与文件对象完全一样
- info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息
- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到
- geturl()：返回请求的url

2 urllib.urlretrieve(url[,filename[,reporthook[,data]]])

urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename，则会存为临时文件。
urlretrieve()返回一个二元组(filename,mine_hdrs)

>>> filename = urllib.urlretrieve('http://www.baidu.com')
>>> type(filename)
<type 'tuple'>
>>> filename
('/tmp/tmphngDjh', <httplib.HTTPMessage instance at 0x7fd5e03ea248>)

>>> filename = urllib.urlretrieve('http://www.baidu.com/',filename='/tmp/baidu') 
>>> type(filename)
<type 'tuple'>
>>> filename
('/tmp/baidu', <httplib.HTTPMessage instance at 0x7fd5e03dbb48>)

3 urllib.urlcleanup()

清除由于urllib.urlretrieve()所产生的缓存

4 urllib.quote(url)和urllib.quote_plus(url)

将url数据获取之后，并将其编码，从而适用与URL字符串中，使其能被打印和被web服务器接受。

>>> urllib.quote('http://www.baidu.com')
'http%3A//www.baidu.com'
>>> urllib.quote_plus('http://www.baidu.com')
'http%3A%2F%2Fwww.baidu.com'

5 urllib.unquote(url)和urllib.unquote_plus(url)

与4的函数相反。

6 urllib.urlencode(query)

将URL中的键值对以连接符&划分

GET方法

>>> import urllib
>>> params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})
>>> params
'eggs=2&bacon=0&spam=1'
>>> f=urllib.urlopen("http://python.org/query?%s" % params)
>>> print f.read()

POST方法

>>> import urllib
>>> parmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0})
>>> f=urllib.urlopen("http://python.org/query", parmas)
>>> f.read()

urllib2

http://www.codefrom.com/paper/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3urllib%E3%80%81urllib2%E5%8F%8Arequests
http://zhuoqiang.me/python-urllib2-usage.html

1 urllib2.urlopen()

>>> import urllib2
>>> url = 'http://www.baidu.com'
>>> req = urllib2.urlopen(url)
>>> req.readline()
'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\n'

2 urllib2.Request()

>>> url = 'http://www.baidu.com'
>>> req = urllib2.Request(url)
>>> resp = urllib2.urlopen(req) #使用对象
>>> resp.readline()
'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\n'

3 urllib2.Request(url[, data][, headers][, originreqhost][, unverifiable])

import urllib, urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' }
data = urllib.urlencode(values)      
req = urllib2.Request(url, data)   #send post
resp = urllib2.urlopen(req)
resp.read()

import urllib, urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
resp = urllib2.urlopen(req)
resp.read()

6 add_header(key, val)

import urllib2

req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')    
resq = urllib2.urlopen(req)

7 PUT和DELETE方法

import urllib2

request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

注意

1. 如果只是单纯的下载或者显示下载进度，不对下载后的内容做处理等，比如下载图片，css，js文件等，可以用urlilb.urlretrieve()
2. 如果是下载的请求需要填写表单，输入账号，密码等，建议用urllib2.urlopen(urllib2.Request())
3. 在对字典数据编码时候，用到的是urllib.urlencode()

posted on 2016-04-20 11:42 北京涛子阅读( ...) 评论( ...) 编辑收藏

转载于:https://www.cnblogs.com/liujitao79/p/5411962.html

python urllib urllib2

urllib

1 urllib.urlopen(url[,data[,proxies]])

2 urllib.urlretrieve(url[,filename[,reporthook[,data]]])

3 urllib.urlcleanup()

4 urllib.quote(url)和urllib.quote_plus(url)

5 urllib.unquote(url)和urllib.unquote_plus(url)

6 urllib.urlencode(query)

urllib2

1 urllib2.urlopen()

2 urllib2.Request()

3 urllib2.Request(url[, data][, headers][, originreqhost][, unverifiable])

5 header

6 add_header(key, val)

7 PUT和DELETE方法