DiveIntoPython(七)

最新推荐文章于 2024-08-09 17:59:09 发布

最新推荐文章于 2024-08-09 17:59:09 发布 · 211 阅读

文章标签：

#Python #HTML #REST #Access #Web

Scripts 专栏收录该内容

299 篇文章

订阅专栏

DiveIntoPython(七)

英文书地址：
http://diveintopython.org/toc/index.html

Chapter 8.HTML Processing
8.1.Diving in
example 8.1.BaseHTMLProcessor.py
example 8.2.dialect.py
Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone.

8.2.Introducing sgmllib.py
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.

SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:

Start tag
An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method.
End tag
An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname.
Character reference
An escaped character referenced by its decimal or hexadecimal equivalent, like . When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
Entity reference
An HTML entity, like ©. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
Comment
An HTML comment, enclosed in . When found, SGMLParser calls handle_comment with the body of the comment.
Processing instruction
An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
Declaration
An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
Text data
A block of text. Anything that doesn't fit into the other 7 categories. When found, SGMLParser calls handle_data with the text.

example 8.4.Sample test of sgmllib.py
In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.

C:\Python26\Lib>python sgmllib.py "d:\data\diveintopython.html" >>d:\data\out.txt

the content of out.txt is :
data: '\n\n'
start tag: <html>
data: '\n '
start tag: <head>
data: '\n '
start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
data: '\n \n '
start tag: <title>
data: 'Dive Into Python'
end tag: </title>
... rest of output omitted for brevity ...

8.3.Extracting data from HTML documents
To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.

example 8.5.Introducing urllib
>>> import urllib
>>> sock = urllib.urlopen("http://diveintopython.org/")
>>> htmlSource = sock.read()
>>> sock.close()
>>> print htmlSource

<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<title>Dive Into Python</title>
...snip...

The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).

The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object.

The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list.

When you're done with the object, make sure to close it, just like a normal file object.

example 8.6.Introducing urllister.py

from sgmllib import SGMLParser

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.

start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.

You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.

String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.

example 8.7.Using urllister.py
>>> import urllib,urllister
>>> usock = urllib.urlopen("http://diveintopython.org/")
>>> parser = urllister.URLLister()
>>> parser.feed(usock.read())
>>> usock.close()
>>> parser.close()
>>> for url in parser.urls:print url
...
http://www.amazon.com/exec/obidos/ASIN/1590593561/ref%3Dnosim/diveintomark20
http://www.amazon.com/exec/obidos/ASIN/1590593561/ref%3Dnosim/diveintomark20
toc/index.html
#download
#languages
toc/index.html
http://diveintopython.org/toc/index.html
appendix/history.html
download/diveintopython-html-5.4.zip

... rest of output omitted for brevity ...

Call the feed method, defined in SGMLParser, to get HTML into the parser.[1] It takes a string, which is what usock.read() returns.

You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.

8.4.Introducing BaseHTMLProcessor.py
BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data.

example 8.8.Introducing BaseHTMLProcessor

from sgmllib import SGMLParser
import htmlentitydefs

class BaseHTMLProcessor(SGMLParser):
def reset(self):
# extend (called by SGMLParser.__init__)
self.pieces = []
SGMLParser.reset(self)

def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

def unknown_endtag(self, tag):
self.pieces.append("</%(tag)s>" % locals())

def handle_charref(self, ref):
self.pieces.append("&#%(ref)s;" % locals())

def handle_entityref(self, ref):
self.pieces.append("&%(ref)s" % locals())
if htmlentitydefs.entitydefs.has_key(ref):
self.pieces.append(";")

def handle_data(self, text):
self.pieces.append(text)

def handle_comment(self, text):
self.pieces.append("" % locals())

def handle_pi(self, text):
self.pieces.append("<?%(text)s>" % locals())

def handle_decl(self, text):
self.pieces.append("<!%(text)s>" % locals())

self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.

example 8.9.BaseHTMLProcessor output
def output(self):
return "".join(self.pieces)

8.5.locals and globals
Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.

example
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.

Each function has its own namespace, called the local namespace, which keeps track of the function's variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any other imported modules, and module-level variables and constants.

When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:
1.local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching.
2.global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching.
3.built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.

If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.

Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function.

example 8.10.Introducing locals
>>> def foo(arg):
... x = 1
... print locals()
...
>>> foo(7)
{'x': 1, 'arg': 7}
>>> foo('bar')
{'x': 1, 'arg': 'bar'}

The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function.

locals returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values of the dictionary are the actual values of the variables. So calling foo with 7 prints the dictionary containing the function's two local variables: arg (7) and x (1).

Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from. With the globals function, you can actually see this happen.

example 8.11.Introducing globals
if __name__ == "__main__":
for k, v in globals().items():
print k, "=", v

running the script from the command line:
E:\book\opensource\python\diveintopython-5.4\py>python.exe BaseHTMLProcessor.py
__copyright__ = Copyright (c) 2001 Mark Pilgrim
__version__ = $Revision: 1.2 $
SGMLParser = sgmllib.SGMLParser
__license__ = Python
__builtins__ = <module '__builtin__' (built-in)>
__file__ = BaseHTMLProcessor.py
htmlentitydefs = <module 'htmlentitydefs' from 'C:\Python26\lib\htmlentitydefs.pyc'>
__author__ = Mark Pilgrim (mark@diveintopython.org)
__date__ = $Date: 2004/05/05 21:57:19 $
BaseHTMLProcessor = __main__.BaseHTMLProcessor
__name__ = __main__

SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module's namespace, and here it is.

Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not.

This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class.

example 8.12. locals is read-only,globals is not
>>> def foo(arg):
... x = 1
... print locals()
... locals()['x'] = 2
... print 'x = ',x
...
>>> z = 7
>>> print 'z = ',z
z = 7
>>> globals()['z'] = 8
>>> print 'z = ',z
z = 8
>>> foo(3)
{'x': 1, 'arg': 3}
x = 1

8.6.Dictionary-based string formatting
There is an alternative form of string formatting that uses dictionaries instead of tuples of values.

example 8.13.Introducing dictionary-based string formatting
>>> params = {'server':'mpilogrim','database':'master','uid':'sa','pwd':'secret'}
>>> "%(pwd)s" % params
'secret'
>>> "%(pwd)s is not a good password for %(uid)s" % params
'secret is not a good password for sa'
>>> "%(database)s of mind,%(database)s of body" % params
'master of mind,master of body'

Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker.

Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the formatting will fail with a KeyError.

example 8.14.Dictionary-based string formatting in BaseHTMLProcessor.py
def handle_comment(self, text):
self.pieces.append("" % locals())

example 8.15.More dictionary-based string formatting
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

8.7.Quoting attribute values
the question is "I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?"

example 8.16.Quoting attribute values
>>> htmlSource = """
... <html>
... <head>
... <title>Test page</title>
... </head>
... <body>
... <ul>
... <li><a href=index.html>Home</a></li>
... <li><a href=toc.html>Table of contents</a></li>
... <li><a href=history.html>revision history</a></li>
... </body>
... </html>
... """
>>> from BaseHTMLProcessor import BaseHTMLProcessor
>>> parser = BaseHTMLProcessor()
>>> parser.feed(htmlSource)
>>> print parser.output()

<html>
<head>
<title>Test page</title>
</head>
<body>
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="toc.html">Table of contents</a></li>
<li><a href="history.html">revision history</a></li>
</body>
</html>

8.8.Introducing dialect.py
example 8.17.Handling specific tags
def start_pre(self, attrs):
self.verbatim += 1
self.unknown_starttag("pre", attrs)

def end_pre(self):
self.unknown_endtag("pre")
self.verbatim -= 1

In the reset method, you initialize a data attribute that serves as a counter for <pre> tags. Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.) In a minute, you'll see how this counter is put to good use.

end_pre is called every time SGMLParser finds a </pre> tag. Since end tags can not contain attributes, the method takes no parameters.

At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding.

example 8.18.SGMLParser
def finish_starttag(self, tag, attrs):
try:
method = getattr(self, 'start_' + tag)
except AttributeError:
try:
method = getattr(self, 'do_' + tag)
except AttributeError:
self.unknown_starttag(tag, attrs)
return -1
else:
self.handle_starttag(tag, method, attrs)
return 0
else:
self.stack.append(tag)
self.handle_starttag(tag, method, attrs)
return 1

def handle_starttag(self, tag, method, attrs):
method(attrs)

In theory, you could use this module to validate that your tags were fully balanced, but it's probably not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.

At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.

example 8.19.Overriding the handle_data method
def handle_data(self, text):
self.pieces.append(self.verbatim and text or self.process(text))

If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer.

You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags.

8.9.Putting it all together
example 8.20. The translate function,part 1
def translate(url, dialectName="chef"):
import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

Hey, wait a minute, there's an import statement in this function! That's perfectly legal in Python. You're used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can also import modules within a function, which means that the imported module is only available within the function.

example 8.21. The translate function,part 2:curiouser and curiouser
parserName = "%sDialectizer" % dialectName.capitalize()
parserClass = globals()[parserName]
parser = parserClass()

capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else to lowercase.

You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer.

Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName.

Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.

example 8.22.The translate function,part 3
parser.feed(htmlSource)
parser.close()
return parser.output()