DiveIntoPython(八)

最新推荐文章于 2025-06-03 09:25:10 发布

最新推荐文章于 2025-06-03 09:25:10 发布 · 117 阅读

文章标签：

#Python #XML #Scheme #Access #UP

Scripts 专栏收录该内容

299 篇文章

订阅专栏

本文介绍了使用Python解析XML文档的方法，包括SAX和DOM两种方式，并详细解释了如何通过DOM方式加载和解析XML文件，查找特定元素及其属性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

DiveIntoPython(八)

英文书地址：
http://diveintopython.org/toc/index.html

Chapter 9.XML Processing
9.1.Diving in
There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds.

The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure.

example 9.1. kgp.py
example 9.2. toolbox.py
example 9.3.Sample output of kgp.py
python.exe kgp.py

example 9.4.Simpler output from kgp.py
python.exe kgp.py -g binary.xml
00011011

You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.

9.2.Packages
example 9.5.Loading an XML document(a sneak peek)
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse("d:/data/binary.xml")

This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom.

example 9.6.File layout of a package
Python26 root Python installation (home of the executable)
|
+--lib/ library directory (home of the standard library modules)
|
+-- xml/ xml package (really just a directory with other stuff in it)
|
+--sax/ xml.sax package (again, just a directory)
|
+--dom/ xml.dom package (contains minidom.py)
|
+--parsers/ xml.parsers package (used internally)

So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”.

example 9.7.Packages are modules,too
>>> from xml.dom import minidom
>>> minidom
<module 'xml.dom.minidom' from 'C:\Python26\lib\xml\dom\minidom.py'>
>>> minidom.Element
<class xml.dom.minidom.Element at 0x02E625D0>
>>> from xml.dom.minidom import Element
>>> Element
<class xml.dom.minidom.Element at 0x02E625D0>
>>> from xml import dom
>>> dom
<module 'xml.dom' from 'C:\Python26\lib\xml\dom\__init__.py'>
>>> import xml
>>> xml
<module 'xml' from 'C:\Python26\lib\xml\__init__.pyc'>

Here you're importing a module (minidom) from a nested package (xml.dom).

Here you are importing the dom package (a nested package of xml) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even have its own attributes and methods, just the modules you've seen before.

So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file.

A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.

9.3.Parsing XML
example 9.8.Loading an XML document(for real this time)
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse("d:/data/binary.xml")
>>> xmldoc
<xml.dom.minidom.Document instance at 0x0152FAD0>
>>> print xmldoc.toxml()
<?xml version="1.0" ?><!DOCTYPE grammar PUBLIC '-//diveintopython.org//DTD Kant Generator Pro v1.0//EN' 'kgp.dtd'><grammar>
<ref id="bit">
0
1
</ref>
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>
</grammar>

The object returned from minidom.parse is a Document object, a descendant of the Node class. This Document object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to minidom.parse.

example 9.9.Getting child nodes
>>> xmldoc.childNodes
[<xml.dom.minidom.DocumentType instance at 0x0152FB48>, <DOM Element: grammar at 0x152fbc0>]
>>> xmldoc.childNodes[0]
<xml.dom.minidom.DocumentType instance at 0x0152FB48>
>>> xmldoc.firstChild
<xml.dom.minidom.DocumentType instance at 0x0152FB48>
Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element).

Since getting the first child node of a node is a useful and common activity, the Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].)

example 9.10.toxml works on any node
>>> node1 = xmldoc.firstChild
>>> node2 = xmldoc.childNodes[1]
>>> print node1.toxml()
<!DOCTYPE grammar PUBLIC '-//diveintopython.org//DTD Kant Generator Pro v1.0//EN' 'kgp.dtd'>
>>> print node2.toxml()
<grammar>
<ref id="bit">
0
1
</ref>
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>
</grammar>

Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element.

example 9.11.Child nodes can be text
>>> node2.childNodes
[<DOM Text node "u'\n'">, <DOM Element: ref at 0x152fbe8>, <DOM Text node "u'\n'">, <DOM Element: ref at 0x152fe40>, <DOM Text node "u'\n'">]
>>> print node2.firstChild.toxml()

>>> print node2.childNodes[1].toxml()
<ref id="bit">
0
1
</ref>
>>> print node2.lastChild.toxml()

>>>

Looking at the XML in binary.xml, you might think that the grammar has only two child nodes, the two ref elements. But you're missing something: the carriage returns! After the '<grammar>' and before the first '<ref>' is a carriage return, and this text counts as a child node of the grammar element. Similarly, there is a carriage return after each '</ref>'; these also count as child nodes. So grammar.childNodes is actually a list of 5 objects: 3 Text objects and 2 Element objects.

The first child is a Text object representing the carriage return after the '<grammar>' tag and before the first '<ref>' tag.

example 9.12.Drilling down all the way to text
>>> node2
<DOM Element: grammar at 0x152fbc0>
>>> refNode = node2.childNodes[1]
>>> refNode
<DOM Element: ref at 0x152fbe8>
>>> refNode.childNodes
[<DOM Text node "u'\n '">, <DOM Element: p at 0x152fcd8>, <DOM Text node "u'\n '">, <DOM Element: p at 0x152fd78>, <DOM Text node "u'\n'">]
>>> pNode = refNode.childNodes[1]
>>> pNode
<DOM Element: p at 0x152fcd8>
>>> pNode.toxml()
u'0'
>>> pNode.firstChild
<DOM Text node "u'0'">
>>> pNode.firstChild.data
u'0'

9.4.Unicode
Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.

To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.[5] Each 2-byte number represents a unique character used in at least one of the world's languages.

example 9.13.Introducing unicode
>>> s = u'Dive In'
>>> s
u'Dive In'
>>> print s
Dive In
>>> a = u'中国'
>>> print a
中国
>>> a
u'\u4e2d\u56fd'

To create a unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode.

When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference.

example 9.14.Storing non-ASCII characters
>>> s = u'La Pe\xf1a'
>>> print s
La Peña

The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ñ” (n with a tilde over it). The unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), which you can type like this: \xf1.

example 9.15.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'

sys.setdefaultencoding('iso-8859-1'):setdefaultencoding function sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.

example 9.17.Specifying encoding in .py files
If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:
#coding=utf-8

example 9.19.Parsing russiansample.xml
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('d:/data/russiansample.xml')
>>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data
>>> title
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
>>> print title
Предисловие

9.5.Searching for elements
If you're looking for something in particular, buried deep within your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.

example 9.20.binary.xml
<?xml version="1.0"?>
<!DOCTYPE grammar PUBLIC "-//diveintopython.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
<grammar>
<ref id="bit">
0
1
</ref>
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>
</grammar>

It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits.

example 9.21.Introducing getElementByTagName
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('d:/data/binary.xml')
>>> reflist = xmldoc.getElementsByTagName("ref")
>>> reflist
[<DOM Element: ref at 0x13d0b98>, <DOM Element: ref at 0x13d0e18>]
>>> print reflist[0].toxml()
<ref id="bit">
0
1
</ref>
>>> print reflist[1].toxml()
<ref id="byte">
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
</ref>

getElementsByTagName takes one argument, the name of the element you wish to find. It returns a list of Element objects, corresponding to the XML elements that have that name. In this case, you find two ref elements.

example 9.22.Every element is searchable
>>> firstref = reflist[0]
>>> print firstref.toxml()
<ref id="bit">
0
1
</ref>
>>> plist = firstref.getElementsByTagName("p")
>>> plist
[<DOM Element: p at 0x13d0c88>, <DOM Element: p at 0x13d0d50>]
>>> print plist[0].toxml()
0
>>> print plist[1].toxml()
1

example 9.23.Searching is actually recursive
>>> plist = xmldoc.getElementsByTagName("p")
>>> plist
[<DOM Element: p at 0x13d0c88>, <DOM Element: p at 0x13d0d50>, <DOM Element: p at 0x13d0f30>]
>>> plist[0].toxml()
u'0'
>>> plist[1].toxml()
u'1'
>>> plist[2].toxml()
u'<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>'

Note carefully the difference between this and the previous example. Previously, you were searching for p elements within firstref, but here you are searching for p elements within xmldoc, the root-level object that represents the entire XML document. This does find the p elements nested within the ref elements within the root grammar element.

9.6.Accessing element attributes
Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents.

example 9.24.Accessing element attributes
>>> xmldoc = minidom.parse("d:/data/binary.xml")
>>> reflist = xmldoc.getElementsByTagName("ref")
>>> bitref = reflist[0]
>>> print bitref.toxml()
<ref id="bit">
0
1
</ref>
>>> bitref.attributes
<xml.dom.minidom.NamedNodeMap object at 0x013EBAD0>
>>> bitref.attributes.keys()
[u'id']
>>> bitref.attributes.values()
[<xml.dom.minidom.Attr instance at 0x013DFE18>]
>>> bitref.attributes["id"]
<xml.dom.minidom.Attr instance at 0x013DFE18>

Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it.

example 9.25.Accessing individual attributes
>>> a = bitref.attributes["id"]
>>> a
<xml.dom.minidom.Attr instance at 0x013DFE18>
>>> a.name
u'id'
>>> a.value
u'bit'