Python-urllib2的使用

最新推荐文章于 2024-11-13 22:54:51 发布

转载最新推荐文章于 2024-11-13 22:54:51 发布 · 278 阅读

本文详细介绍了Python中urllib2模块的使用方法，包括如何打开URL、发送GET和POST请求、设置请求头等操作。适用于初学者快速上手Python网络爬虫。

最近学习写Python爬虫的实战，会经常用到urllib2库中的函数。因此转载自http://www.cnblogs.com/youxin/archive/2013/05/07/3064434.html。

urllib2是一个类似curl的Python扩展，默认已经安装。官网：http://docs.python.org/2/library/urllib2.html

The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

urllib2.urlopen(url[, data][, timeout])

Open the URL url, which can be either a string or a Request object.

This function returns a file-like object with two additional methods。

class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])¶

This class is an abstraction of a URL request.

使用如下：

import urllib2
req=urllib2.Request("http://www.baidu.com") //urllib2.Request(url)
response=urllib2.urlopen(req)
html=response.read()

或者

import urllib2
html = urllib2.urlopen('http://piratebay.se/browse/200').read()

Post 传参

import urllib
import urllib2

url='http://localhost/php/GetPost.php'

dataArr={'name':'jack','password':'pass'}
data=urllib.urlencode(dataArr)#encode a sequence of two tuple or dict

req=urllib2.Request(url,data)
res=urllib2.urlopen(req)
html=res.read()

Get传参

import urllib
import urllib2

url='http://localhost/php/GetPost.php'
dataArr={'name':'jack','password':'pass'}
data=urllib.urlencode(dataArr)#encode a sequence of two tuple or dict

full_url = url + '?' + data
 
req=urllib2.Request(full_url)
res=urllib2.urlopen(req)
html=res.read()

设置Header到Http请求

有一些站点不喜欢被非人为地访问，或者发送不同的版本给不同的浏览器。

默认的urllib2把自己作为‘Python-urllib/x.y’如'‘Python-urllib./2.7'

这身份可能会使站点身份迷惑，或者干脆不工作。

浏览器确认自己身份是通过User-Agent头（http://baike.baidu.com/view/3398471.htm），当你创建了一个请求对象，你可以给他一个包含头数据的字典。

下面的例子发送跟上面一样的内容，但把自身模拟成Internet Explorer。

import urllib  
import urllib2  

url = 'http://www.someserver.com/cgi-bin/register.cgi'
<strong>
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' </strong> 
values = {'name' : 'WHY',  
          'location' : 'SDU',  
          'language' : 'Python' }  

headers = { 'User-Agent' : user_agent }  
data = urllib.urlencode(values)  
req = urllib2.Request(url, data, headers)  
response = urllib2.urlopen(req)  
the_page = response.read()