核心技术（卷二）03、第2章-XML

最新推荐文章于 2022-03-17 23:15:03 发布

原创最新推荐文章于 2022-03-17 23:15:03 发布 · 264 阅读

0 ·

CC 4.0 BY-SA版权

Java 专栏收录该内容

20 篇文章

订阅专栏

本文深入讲解XML解析的各种方法，包括DOM、SAX和StAX解析器的使用，以及如何利用XPath进行精确的数据定位。探讨了DTD和XML Schema的约束机制，展示了如何在代码中开启XML验证，确保文档的正确性。

XML

文档对象模型（DOM）解析器/树型解析器

我们通过DOM解析器来解析下面的XML文档：

<?xml version="1.0" encodeing="utf-8"?>
<font>
	<name>Helvetica</name>
	<size unit="pt">36</size>
</font>

通过DocumentBuilder读入XML文档

//获取文档构造工厂实例
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstace();
//获取文档构造类实例
DocumentBuilder builder = factory.newDocumentBuilder();
//从文件中读入文档
File f = ...;
Document document = builder.parse(f);
//从URL中读入
URL u = ...;
document = builder.parse(u);
//从输入流读入
InputStream in = ...;
document = builder.parse(in);

解析文档

//获取根节点
Element font = document.getDocumentElement();
//获取根节点下的所有子节点
NodeList nodes = root.getChildNodes();
//获取子节点个数
int length = nodes.getLength();

在本例中，获取的子结点个数不止2个，它还包括了<font><name>,</name><sieze>,</size><font>这3对标签中的空白字符。我们在解析时需要过滤这些空白字符。

//遍历子节点
for (int i = 0; i < length; i++){
    //获取子节点实例
    Node node = nodes.item(i);
    //过滤空白字符
    if (node instanceof Element){
   	 Element child = (Element)node;
   	 //获取标签名
   	 String tagName = child.getTagName();
   	 //因为<name>节点是一个叶子节点，我们直接获取它的文本对象
   	 Text text = (Text)child.getFirstChild();//也可以调用getLastChild()
   	 //获取文本内容
   	 String data = text.getData().trim();//调用trim方法是因为文本内容可能更节点不在一行
   	 if (tagName.equals("size")){
   		 //获取节点的属性
   		 NamedNodeMap attributes = child.getAttributes();
   		 //遍历属性
   		 for (int j = 0; j < attributes.getLength(); j++){
   			 Node attribute = attributes.item(j);
   			 String aName = attribute.getNodeName();
   			 Strig aValue = attribute.getNodeValue();
   		 }
   		 //也可以直接获取某个属性的值
   		 String unit = child.getAttribute("unit");
   	 }
    }
}

以上就是遍历一个XML文档的全部过程。下面是Node接口的继承结构：

Node接口及其子接口

所有的节点都是一个Node，节点的每一个属性也是一个Node。

验证XML文档

DTD

XML文档通常由一个文档头开始，如：<?xml version="1.0">

文档头过后通常是文档类型定义DTD(Document Type Definition) ，如：

<!DOCTYPE web-app PUBLIC "-//Sun Microsystems,Inc.//DTD Wdb Application 2.2//EN" "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">

DTD是确保文档正确的重要机制，但是它不是必须的。

约束标签：
```
<!ELEMENT font (name,size)>
<!ELEMENT name (#PCDATA)>
```
约束<font>节点必须有<name>和<size>两个节点；<name>元素的子元素为文本。这里可以指定正则表达式

规则	含义
E*	有0个或多个E
E+	有1个或多个E
E?	有0个或1个E
E1\|E2\|…\|En	E1，E2…En中的一个
E1,E2,…En	E1,随后是E2…En
(#PCDATA\|E1\|E2…\|En)*	0个或多个文本与E1,E2…En，以任意顺序排列（混合模式）
ANY	允许任意子元素
EMPTY	不允许有子元素

2.约束属性

<!ATTLIST size unit CDATA #REQUIRED>

约束<size>标签必须有unit属性。

属性值的类型：

类型	含义	解释
CDATA	任意字符串
(A1\|A2\|…\|An)	字符串属性A1,A2…An中的一个
NMTOKEN NMTOKENS	1或多个名字标记	NMTOKENS 是一个以空白字符分割的标记列表
ID	1个唯一的ID	文档中唯一的名字标记，解析器会检查其唯一性
IDREF IDREFS	1个或多个对唯一ID的引用	IDREF是对同一文档中已存在id的引用；IDREFS是以空白字符隔开的ID引用列表
ENTITY ENTITIES	1个或多个未解析的实体	对未解析外部实体的引用

实体

实体的定义：<!ENTITY back.label "back">

引用实体：<menuitem lable="&back.label"/>，解析器会用字符串"back"替换该实体引用。在用于进行国际化处理的时候，只需要修改实体定义中的字符串即可。

在代码中开启XML验证

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
//在解析时开启验证
documentBuilderFactory.setValidating(true);
//忽略标签之间的空白字符
documentBuilderFactory.setIgnoringElementContentWhitespace(true);
//在开启验证之后再获取构造器，否则将不会验证
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
//安装一个错误处理器
documentBuilder.setErrorHandler(new ErrorHandler() {
       @Override
       public void warning(SAXParseException exception) throws SAXException {
           exception.printStackTrace();
       }

       @Override
       public void error(SAXParseException exception) throws SAXException {
           exception.printStackTrace();
       }

       @Override
       public void fatalError(SAXParseException exception) throws SAXException {
           exception.printStackTrace();
       }
   });
   Document document = documentBuilder.parse("config.xml");

XML Schema

要在文档中引用Schema文件，需要在很元素中添加属性：

<?xml version="1.0" encodeing="utf-8"?>
<font xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="config.xsd">
	...
</font>

这个声明说明Schema文件config.xsd会被用来验证文档。

内建简单类型：
```
xsd:string
xsd:int
xsd:boolean
```

自定义简单类型：

<!-- 一个枚举类型 -->
<xsd:simpleType name="styleType">
  <xsd:restriction base="xsd:string">
    <xsd:enumeration value="PLAIN"/>
				<xsd:enumeration value="BOLD"/>
				<xsd:enumeration value="ITALIC"/>
		</xsd>
</xsd>

定义元素

<xsd:element name="name" type="xsd:string"/>
<xsd:element name="size" type="xsd:int"/>

定义元素时要指定类型

复杂类型
```
<xsd:complexType name="FontType">
  <xsd:sequence>
    <xsd:element ref="name"/>
				<xsd:element ref="size"/>
				<xsd:element ref="style"/>
		</xsd>
</xsd>
```
定义复杂类型时，使用ref来引用在Schema中处于别处的定义。FontType是<name>,<size>和<style>的序列。<xsd:sequece>同DTD中的连接符号等价（’,’）。

<xsd:choose>

<xsd:complexType name="contactinfo">
  <xsd:choose>
    <xsd:element ref="phone"/>
				<xsd:element ref="email"/>
		</xsd>
</xsd>

与DTD中的|操作符等价

minoccurs和maxoccurs
```
<xsd:element name="item" type="xsd:string" minoccurs="0" maxoccurs="unbounded"/>
```
同DTD中的(item)* ,<item>元素可以在一个父元素中出现0次或多次

指定属性

<xsd:element name="size">
  <xsd:complexType>
    <xsd:attribute name="unit" type="xsd:string" use="optional" defualt="cm"/>
		</xsd>
</xsd>

要指定属性，将xsd:attribute放在complexType定义中去

封装定义

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
	...
</xsd>

解析带有Schema的XML文件需要：

必须打开对命名空间的支持，即使在xml中不使用它。

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);

必须通过如下的“魔咒”来准备好处理Schema的工厂

final String JAXP_SCHEMA_LANGUAGE = "http://java.sun.com/xml/jaxp/properties/schemaLanguage";
   final String W2C_XML_SCHEMA = "http://www.w3.org/2001/XMLSchema";
   documentBuilderFactory.setAttribute(JAXP_SCHEMA_LANGUAGE,W2C_XML_SCHEMA);
   ```

使用XPath来定位信息

XPath语言使得访问树节点变得很容易。我们不再需要遍历XML的节点，直接使用表达式便可以访问节点信息。

	//创建XPath工厂以实例化一个XPath实例
	XPathFactory xPathFactory = XPathFactory.newInstance();
 XPath path = xPathFactory.newXPath();
 //计算XPath表达式，访问节点信息
 String size = path.evaluate("/font/size",document);
	//@attrName，以获取节点的属性值
	String sizeUnit = path.evaluate("/font/size/@unit",document);
	//还可以用[]符来选择特定元素,下列语句获取gridbag节点下的第一个row节点文本信息（下标从1开始）
	String row1 = path.evaluate("/gridbag/row[1]",document);

javax.xml.xpath.Xpath接口的主要方法：

其中returnType可以是javax.xml.xpath.XPathConstants的静态域：

使用命名空间

XML使用命名空间来避免名字冲突，命名空间可用于元素名和属性名。名字空间是用统一资源标识符来标识的。其中HTTP URL格式是最常用的。

<element xmlns="namespaceURI">
	children
</element>

<element>元素及其子元素都是namespaceURI命名空间的元素。

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
	<xsd:element name="gridbag" type="GridBagType">
</xsd>

xmlns:prefix="namespaceURI"用于定义命名空间和前缀。在上面的例子中，xsd是前缀。这样xsd:schema其实指的是命名空间http://www.w3.org/2001/XMLSchema中的schema。

Dom解析器不是“命名空间感知的”，需要打开命名空间感知：

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newDocumentBuilder();
documentBuilderFactory.setNamespaceAware(true);

这样每个节点有三个属性：

带有前缀的限定名（xsd:schema）：由getNodeName和getTagName方法返回
命名空间URI（http://www.w3.org/2001/XMLSchema）：由getNameSpaceURI方法返回
不带前缀和命名空间的本地名（schema）：由getLocalName方法返回。

流解析器

DOM解析器完整的读入XML文档，并生成一个树形数据结构。如果文档很大，并且处理算法又十分简单，那么DOM解析器效率就非常低，此时应该使用流解析器。

SAT解析器

SAT解析器使用事件回调的方式解析XML文档，提供的主要事件回调方法有：

startElement,endElement
characters
startDocument,endDocument

这些方法被定义在org.xml.sax.ContentHandler接口中，我们必须覆盖这些方法。我们可以继承实现了该接口的org.xml.sax.helpers.DefaultHandler类，该类为上main的回调方法提供了空方法。

DefaultHandler handler = new DefaultHandler(){
		void	startDocument(){
			 ...
		}
		void	startElement(String uri, String localName, String qName, Attributes attributes){
			 //uri - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
				//localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
				//qName - The qualified name (with prefix), or the empty string if qualified names are not available.
				//attributes - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
		}
		void	characters(char[] ch, int start, int length){
			 //ch - The characters.
				//start - The start position in the character array.
				//length - The number of characters to use from the character array.
		}
		void	endElement(String uri, String localName, String qName){
			 //uri - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
				//localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
    //qName - The qualified name (with prefix), or the empty string if qualified names are not available.
		}
		void	endDocument(){
			 ...
		}
}

解析器的使用：

SAXParserFactory factory = SAXParserFactory.newInstace();
SAXParser parser = factory.newSAXParser();
//处理文档
parser.parser(source,handler);
//source可以是一个文件、一个URL字符串或者一个输入流。

StAX解析器

我们使用下面的基本循环来迭代所有事件：

InputStream in = url.openStream();
XMLInputFactory factory = XMLInputFactory.newInstace();
XMLStreamReader parser = factory.createXMLStreamReader(in);
while(parser.hasNext()){
	int event = parser.next();
	switch(event){
		case XMLStreamConstants.START_DOCUMENT:
   ...;
		 break;
		case XMLStreamConstants.START_ELEMENT:
				parser.getLocalName();
				break;
		...
	}
}