DiveIntoPython(十)

最新推荐文章于 2024-08-09 18:01:36 发布

最新推荐文章于 2024-08-09 18:01:36 发布 · 234 阅读

文章标签：

#FP #Python #XML #F# #Web

Scripts 专栏收录该内容

299 篇文章

订阅专栏

本文深入探讨了 HTTP 协议的基础知识及其实现细节，包括如何使用 GET 和 POST 方法进行数据交互，支持重定向、缓存验证（如 Last-Modified 和 ETag）、压缩等关键特性。

DiveIntoPython(十)

英文书地址：
http://diveintopython.org/toc/index.html

Chapter 11.HTTP Web Services
11.1.Diving in
If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server, use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE.)

In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending, modifying, and deleting data.

example 11.1.openanything.py

11.2.How not to fetch data over HTTP
Let's say you want to download a resource over HTTP, such as a syndicated Atom feed.you want to download it over and over again, every hour, to get the latest news from the site that's offering the news feed.

example 11.2.Downloading a feed the quick-and-dirty way
>>> import urllib
>>> data = urllib.urlopen("http://hi.baidu.com/luohuazju/rss").read()
>>> print data
<?xml version="1.0" encoding="gb2312"?>
<rss version="2.0">
<channel>

11.3.Features of HTTP
There are five important features of HTTP which you should support.
11.3.1.User-Agent agent ['eidʒənt] n. 代理人，代理商；药剂；特工
When the client requests a resource, it should always announce who it is, as specifically as possible.By default, Python sends a generic User-Agent: Python-urllib/1.15.

11.3.2.Redirects
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything's normal, here's the page you asked for”. Status code 404 means “page not found”. (You've probably seen 404 errors while browsing the web.)

HTTP has two different ways of signifying that a resource has moved. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location: header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location: header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you're supposed to use the new address from then on.

11.3.3.Last-Modified/If-Modified-Since
HTTP provides a way for the server to include this last-modified date along with the data you requested.

If you ask for the same data a second time (or third, or fourth), you can tell the server the last-modified date that you got last time: you send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn't changed since then, the server sends back a special HTTP status code 304, which means “this data hasn't changed since the last time you asked for it”.

Python's URL library has no built-in support for last-modified date checking, but since you can add arbitrary headers to each request and read arbitrary headers in each response, you can add support for it yourself.

11.3.4.ETag/If-None-Match
ETags are an alternate way to accomplish the same thing as the last-modified date checking: don't re-download data that hasn't changed.

11.3.5.Compression
The last important HTTP feature is gzip compression.

XML is text, and quite verbose text at that, and text generally compresses well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send it in compressed format. You include the Accept-encoding: gzip header in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with a Content-encoding: gzip header.

you can add arbitrary headers to the request. And Python comes with a separate gzip module, which has functions you can use to decompress the data yourself.

11.4.Debugging HTTP web services
example 11.3.Debugging HTTP
>>> import httplib
>>> httplib.HTTPConnection.debuglevel = 1
>>> import urllib
>>> data = urllib.urlopen("http://www.baidu.com").read()
send: 'GET / HTTP/1.0\r\nHost: www.baidu.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 12 Mar 2010 07:15:25 GMT
header: Server: BWS/1.0
header: Content-Length: 3521
header: Content-Type: text/html;charset=gb2312
header: Cache-Control: private
header: Expires: Fri, 12 Mar 2010 07:15:25 GMT
header: Set-Cookie: BAIDUID=95623D0FCCC60D79598FD3AF7CFA462C:FG=1; expires=Fri, 12-Mar-40 07:15:25 GMT; path=/; domain=.baidu.com
header: P3P: CP=" OTI DSP COR IVA OUR IND COM "

>>> data = urllib.urlopen("http://localhost/binary.xml").read()
send: 'GET /binary.xml HTTP/1.0\r\nHost: localhost\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 12 Mar 2010 07:38:13 GMT
header: Server: Apache/2.2.15 (Win32)
header: Last-Modified: Thu, 21 Feb 2002 05:45:46 GMT
header: ETag: "6000000002762-155-39a7937adf680"
header: Accept-Ranges: bytes
header: Content-Length: 341
header: Connection: close
header: Content-Type: application/xml

11.5.Setting the User-Agent
example 11.4.Introducing urllib2
tmp.py:
import urllib2
request = urllib2.Request("http://localhost/binary.xml")
opener = urllib2.build_opener()
opener.handle_open["http"][0].set_http_debuglevel(1)
feeddata = opener.open(request).read()

console:
send: 'GET /binary.xml HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: localhost\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 12 Mar 2010 07:56:46 GMT
header: Server: Apache/2.2.15 (Win32)
header: Last-Modified: Thu, 21 Feb 2002 05:45:46 GMT
header: ETag: "6000000002762-155-39a7937adf680"
header: Accept-Ranges: bytes
header: Content-Length: 341
header: Connection: close
header: Content-Type: application/xml

example 11.5.Adding headers with the Request
>>> request
<urllib2.Request instance at 0x0145D800>
>>> request.get_full_url()
'http://localhost/binary.xml'
>>> request.add_header("User-Agent","OpenThing/1.0 + http://localhost")
>>> feeddata = opener.open(request).read()

but fail to add header

11.6.Handling Last-Modified and ETag
example 11.6.Testing Last-Modified
>>> import urllib2
>>> request = urllib2.Request("http://localhost/binary.xml")
>>> opener = urllib2.build_opener()
>>> firstdatastream = opener.open(request)
>>> firstdatastream.headers.dict
{'content-length': '341', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.15 (Win32)', 'last-modified': 'Thu, 21 Feb 2002 05:45:46 GMT', 'connection': 'close', 'etag': '"6000000002762-155-39a7937adf680"', 'date': 'Fri, 12 Mar 2010 08:09:02 GMT', 'content-type': 'application/xml'}
>>> request.add_header("If-Modified-Since",firstdatastream.headers.get("Last-Modified"))
>>> seconddatastream = opener.open(request)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 395, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 508, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 433, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 367, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 516, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 304: Not Modified

firstdatastream.headers is an object that acts like a dictionary and allows you to get any of the individual headers returned from the HTTP server.

Sure enough, the data hasn't changed. You can see from the traceback that urllib2 throws a special exception, HTTPError, in response to the 304 status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending you any data. That's not an error; that's exactly what you were hoping for.

urllib2 also raises an HTTPError exception for conditions that you would think of as errors, such as 404 (page not found). In fact, it will raise HTTPError for any status code other than 200 (OK), 301 (permanent redirect), or 302 (temporary redirect).

example 11.7.Defining URL handlers
This custom URL handler is part of openanything.py.
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
def http_error_default(self, req, fp, code, msg, headers):
result = urllib2.HTTPError(
req.get_full_url(), code, msg, headers, fp)
result.status = code
return result

example 11.8.Using custom URL handlers
>>> request.headers
{'If-modified-since': 'Thu, 21 Feb 2002 05:45:46 GMT'}
>>> import openanything
>>> opener = urllib2.build_opener(openanything.DefaultErrorHandler)
>>> seconddatastream = opener.open(request)
>>> seconddatastream.status
304
>>> seconddatastream.read()
''

Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use seconddatastream.headers.dict to acess them), also contains the HTTP status code. In this case, as you expected, the status is 304, meaning this data hasn't changed since the last time you asked for it.

Note that when the server sends back a 304 status code, it doesn't re-send the data. That's the whole point: to save bandwidth by not re-downloading data that hasn't changed. So if you actually want that data, you'll need to cache it locally the first time you get it.

Handling ETag works much the same way, but instead of checking for Last-Modified and sending If-Modified-Since, you check for ETag and send If-None-Match.

example 11.9.Supporting ETag/If-None-Match
>>> import urllib2,openanything
>>> request = urllib2.Request("http://localhost/binary.xml")
>>> opener = urllib2.build_opener(openanything.DefaultErrorHandler())
>>> firstdatastream = opener.open(request)
>>> firstdatastream.headers.get("ETag")
'"6000000002762-155-39a7937adf680"'
>>> firstdata = firstdatastream.read()
>>> print firstdata
<?xml version="1.0"?>
<!DOCTYPE grammar PUBLIC "-//diveintopython.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
<grammar>
<ref id="bit">
0
1
</ref>
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>
</grammar>

>>> request.add_header("If-None-Match",firstdatastream.headers.get("ETag"))
>>> seconddatastream = opener.open(request)
>>> seconddatastream.status
304
>>> seconddatastream.read()
''

11.7.Handling redirects

Since the url is wrong,so I did not type these examples one by one,but note the useful code.

example 11.10.Accessing web services without a redirect handler
>>> import urllib2, httplib
>>> httplib.HTTPConnection.debuglevel = 1
>>> request = urllib2.Request(
... 'http://diveintomark.org/redir/example301.xml')
>>> opener = urllib2.build_opener()
>>> f = opener.open(request)

example 11.11.Defining the redirect handler
This class is defined in openanything.py.

class SmartRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_301(
self, req, fp, code, msg, headers)
result.status = code
return result

def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_302(
self, req, fp, code, msg, headers)
result.status = code
return result

example 11.12.Using the redirect handler to detect permanent redirects
>>> request = urllib2.Request('http://diveintomark.org/redir/example301.xml')
>>> import openanything, httplib
>>> opener = urllib2.build_opener(
... openanything.SmartRedirectHandler())
>>> f = opener.open(request)
>>> f.status
301
>>> f.url
'http://diveintomark.org/xml/atom.xml'

The object you get back from the opener contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent address).

example 11.13. Using the redirect handler to detect temporary redirects
>>> request = urllib2.Request(
... 'http://diveintomark.org/redir/example302.xml')
>>> f = opener.open(request)
>>> f.status
302
>>> f.url
http://diveintomark.org/xml/atom.xml

11.8.Handling compressed data
Servers won't give you compressed data unless you tell them you can handle it.

Since the url is wrong,so I did not type these examples one by one,but note the useful code.

example 11.14.Telling the server you would like compressed data
>>> import urllib2, httplib
>>> httplib.HTTPConnection.debuglevel = 1
>>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
>>> request.add_header('Accept-encoding', 'gzip')
>>> opener = urllib2.build_opener()
>>> f = opener.open(request)

example 11.15.Decompressing the data
>>> compresseddata = f.read()
>>> len(compresseddata)
6289
>>> import StringIO
>>> compressedstream = StringIO.StringIO(compresseddata)
>>> import gzip
>>> gzipper = gzip.GzipFile(fileobj=compressedstream)
>>> data = gzipper.read()
>>> print data
<?xml version="1.0" encoding="iso-8859-1"?>
<feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
<title mode="escaped">dive into mark</title>
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
<-- rest of feed omitted for brevity -->
>>> len(data)
15955

example 11.16.Decompressing the data directly from the server is wrong
>>> f = opener.open(request)
>>> f.headers.get('Content-Encoding')
'gzip'
>>> data = gzip.GzipFile(fileobj=f).read()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "c:\python23\lib\gzip.py", line 217, in read
self._read(readsize)
File "c:\python23\lib\gzip.py", line 252, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'

Since opener.open returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data, why not simply pass that file-like object directly to GzipFile? As you “read” from the GzipFile instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't work. Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream. So the inelegant hack of using StringIO is the best solution: download the compressed data, create a file-like object out of it with StringIO, and then decompress the data from that.

11.9.Putting it all together
all is in openanything.py file.

example 11.19.Using openanything.py
>>> import openanything
>>> useragent = "MyHTTPWebServiceApp/1.0"
>>> url = "http://localhost/binary.xml"
>>> params = openanything.fetch(url,agent=useragent)
>>> params
{'url': 'http://localhost/binary.xml', 'lastmodified': 'Thu, 21 Feb 2002 05:45:46 GMT', 'etag': '"6000000002762-155-39a7937adf680"', 'data': '<?xml version="1.0"?>\n<!DOCTYPE grammar PUBLIC "-//diveintopython.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">\n<grammar>\n<ref id="bit">\n 0\n 1\n</ref>\n<ref id="byte">\n <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\n</ref>\n</grammar>\n', 'status': 200}