DiveIntoPython(九)

最新推荐文章于 2024-08-09 17:55:40 发布

最新推荐文章于 2024-08-09 17:55:40 发布 · 114 阅读

文章标签：

#Python #OpenSource #XML #Web #IDE

Scripts 专栏收录该内容

299 篇文章

订阅专栏

本文介绍如何使用Python解析和处理XML文件，包括从文件、URL和字符串中读取XML，利用标准输入输出进行数据流处理，以及节点缓存、随机子元素选择等高级技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

DiveIntoPython(九)

英文书地址：
http://diveintopython.org/toc/index.html

Chapter 10.Scripts and Streams
10.1.Abstracting input sources
In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data.

example 10.1.Parsing XML from a file
>>> from xml.dom import minidom
>>> fsock = open("d:/data/binary.xml")
>>> xmldoc = minidom.parse(fsock)
>>> fsock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?><!DOCTYPE grammar PUBLIC '-//diveintopython.org//DTD Kant Generator Pro v1.0//EN' 'kgp.dtd'><grammar>
<ref id="bit">
0
1
</ref>
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>
</grammar>

First, you open the file on disk. This gives you a file object.
You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk.
Be sure to call the close method of the file object after you're done with it. minidom.parse will not do this for you.

example 10.2.Parsing XML from a URL
>>> import urllib
>>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf')
>>> xmldoc = minidom.parse(usock)
>>> usock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?><rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<channel>
<title>Slashdot</title>
<link>http://slashdot.org/</link>
<description>News for nerds, stuff that matters</description>
</channel>

<image>
<title>Slashdot</title>
<url>http://a.fsdn.com/sd/topics/topicslashdot.gif</url>
<link>http://slashdot.org/</link>
</image>
...snip..

urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page.

Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.

example 10.3.Parsing XML from a string (the easy but iinflexible way)
>>> contents = "<grammar><ref id='bit'>01</ref></grammar>"
>>> xmldoc = minidom.parseString(contents)
>>> print xmldoc.toxml()
<?xml version="1.0" ?><grammar><ref id="bit">01</ref></grammar>

minidom has a method, parseString, which takes an entire XML document as a string and parses it. You can use this instead of minidom.parse if you know you already have your entire XML document in a string.

If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.

example 10.4.Introducing StringIO
>>> contents = "<grammar><ref id='bit'>01</ref></grammar>"
>>> import StringIO
>>> ssock = StringIO.StringIO(contents)
>>> ssock.read()
"<grammar><ref id='bit'>01</ref></grammar>"
>>> ssock.read()
''
>>> ssock.seek(0)
>>> ssock.read(15)
'<grammar><ref i'
>>> ssock.read()
"d='bit'>01</ref></grammar>"
>>> ssock.close()

The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object. The StringIO class takes the string as a parameter when creating an instance.
Calling read again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any more without explicitly seeking to the beginning of the file. The StringIO object works the same way.

example 10.5.Parsing XML from a string(the file-like object way)
>>> contents
"<grammar><ref id='bit'>01</ref></grammar>"
>>> ssock=StringIO.StringIO(contents)
>>> xmldoc = minidom.parse(ssock)
>>> ssock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?><grammar><ref id="bit">01</ref></grammar>

example 10.6.openAnything
def openAnything(source):
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
return urllib.urlopen(source)
except (IOError, OSError):
pass
# try to open with native open function (if source is pathname)
try:
return open(source)
except (IOError, OSError):
pass
# treat source as string
import StringIO
return StringIO.StringIO(str(source))

example 10.7.Using openAnything in kgp.py
class KantGenerator:
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()
return xmldoc

10.2. Standard input,output, and error
Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)

example 10.8.Introducing stdout and stderr
>>> for i in range(3):
... print 'dive in'
...
dive in
dive in
dive in
>>> import sys
>>> for i in range(3):
... sys.stdout.write('dive in')
...
dive indive indive in
>>> for i in range(3):
... sys.stderr.write('dive in')
...
dive indive indive in

stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write.

stdout and stderr are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.

example 10.9.Redirecting output
E:\book\opensource\python\diveintopython-5.4\py\kgp>python.exe stdout.py
Dive in

the code of stdout.py:
import sys

print 'Dive in'
saveout = sys.stdout
fsock = open('out.log', 'w')
sys.stdout = fsock
print 'This message will be logged instead of displayed'
sys.stdout = saveout
fsock.close()

Always save stdout before redirecting it, so you can set it back to normal later.

Redirect all further output to the new file you just opened.

This will be “printed” to the log file only; it will not be visible in the IDE window or on the screen.

Set stdout back to the way it was before you mucked with it.

example 10.10.Redirecting error information
the code of the file stderr.py:
import sys

fsock = open('error.log', 'w')
sys.stderr = fsock
raise Exception, 'this error will be logged'

Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to error.log.

Also note that you're not explicitly closing your log file, nor are you setting stderr back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that stderr is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for stdout, if you expect to go do other stuff within the same script afterwards.

example 10.11.Printing to stderr
>>> print 'entering function'
entering function
>>> import sys
>>> print >> sys.stderr,'enterring function'
enterring function

example 10.12.Chaining commands
E:\book\opensource\python\diveintopython-5.4\py\kgp>type binary.xml | python.exe kgp.py -g -
11100000

This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the next command, which in this case calls your Python script.

example 10.13.Reading from standard input in kgp.py
def openAnything(source):
if source == "-":
import sys
return sys.stdin

10.3.Caching node lookups
The slow way to do it would be to get the entire list of ref elements each time, then manually loop through and look at each id attribute. The fast way is to do that once and build a cache, in the form of a dictionary.

example 10.14.loadGrammar
def loadGrammar(self, grammar):
self.grammar = self._load(grammar)
self.refs = {}
for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref

example 10.15.Using the ref element cache
def do_xref(self, node):
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))

You'll explore the randomChildElement function in the next section.

10.4.Finding direct children of a node
You might think you could simply use getElementsByTagName for this, but you can't. getElementsByTagName searches recursively and returns a single list for all the elements it finds. Since p elements can contain other p elements, you can't use getElementsByTagName, because it would return nested p elements that you don't want. To find only direct child elements, you'll need to do it yourself.

example 10.16.Finding direct child elements
def randomChildElement(self, node):
choices = [e for e in node.childNodes
if e.nodeType == e.ELEMENT_NODE]
chosen = random.choice(choices)
return chosen

As you saw in Example 9.9, “Getting child nodes”, the childNodes attribute returns a list of all the child nodes of an element.

However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodes contains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the children that are elements.

Each node has a nodeType attribute, which can be ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, or any number of other values.

Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called random which includes several useful functions. The random.choice function takes a list of any number of items and returns a random item. For example, if the ref elements contains several p elements, then choices would be a list of p elements, and chosen would end up being assigned exactly one of them, selected at random.

10.5.Creating separate handlers by node type
example 10.17.Class names of parsed XML objects
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('d:/data/kant.xml')
>>> xmldoc
<xml.dom.minidom.Document instance at 0x01473990>
>>> xmldoc.__class__
<class xml.dom.minidom.Document at 0x0138FD80>
>>> xmldoc.__class__.__name__
'Document'

example 10.18.parse,a generic XML node dispatcher
def parse(self, node):
parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
parseMethod(node)

example 10.19.Functions called by the parse dispatcher
def parse_Document(self, node):
self.parse(node.documentElement)

def parse_Text(self, node):
text = node.data
if self.capitalizeNextWord:
self.pieces.append(text[0].upper())
self.pieces.append(text[1:])
self.capitalizeNextWord = 0
else:
self.pieces.append(text)

def parse_Comment(self, node):
pass

def parse_Element(self, node):
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)

10.6.Handling command-line arguments
example 10.20.Introducing sys.argv
#argecho.py
import sys

for arg in sys.argv:
print arg

example 10.21.The contents of sys.argv
E:\book\opensource\python\diveintopython-5.4\py>python.exe argecho.py
argecho.py

E:\book\opensource\python\diveintopython-5.4\py>python.exe argecho.py abc def
argecho.py
abc
def

E:\book\opensource\python\diveintopython-5.4\py>python.exe argecho.py --help
argecho.py
--help

E:\book\opensource\python\diveintopython-5.4\py>python.exe argecho.py -m kant.xml
argecho.py
-m
kant.xml

The first thing to know about sys.argv is that it contains the name of the script you're calling.

example 10.22.Introducing getopt
def main(argv):
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
...

if __name__ == "__main__":
main(sys.argv[1:])

First off, look at the bottom of the example and notice that you're calling the main function with sys.argv[1:]. Remember, sys.argv[0] is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off and pass the rest of the list.

This is where all the interesting processing happens. The getopt function of the getopt module takes three parameters: the argument list (which you got from sys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is explained in more detail below.

"hg:d"
-h
print usage summary
-g ...
use specified grammar file or URL
-d
show debugging information while parsing

The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change state (turn on debugging). However, the second flag (-g) must be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address, and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tell getopt this by putting a colon after the g in that second parameter to the getopt function.

To further complicate things, the script accepts either short flags (like -h) or long flags (like --help), and you want them to do the same thing. This is what the third parameter to getopt is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.

["help", "grammar="]
--help
print usage summary
--grammar ...
use specified grammar file or URL

The --grammar flag must always be followed by an additional argument, just like the -g flag. This is notated by an equals sign, "grammar=".

example 10.23.Handling command-line arguments in kgp.py
def main(argv):
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts:
if opt in ("-h", "--help"):
usage()
sys.exit()
elif opt == '-d':
global _debug
_debug = 1
elif opt in ("-g", "--grammar"):
grammar = arg

source = "".join(args)

k = KantGenerator(grammar, source)
print k.output()

The opts variable that you get back from getopt contains a list of tuples: flag and argument. If the flag doesn't take an argument, then arg will simply be None. This makes it easier to loop through the flags.

10.7.Putting it all together