1 了解JAXP
JAXP 是Java API for XML Processing的缩写。JAXP API主要的部分在javax.xml.parsers 这个包中。在这个包中,向用户提供了两个最重要的工厂类,SAXParserFactory 和DocumentBuilderFactory,相应地,提供了SAXParser 和DocumentBuilder两个类。
SAX是由XML-DEV定义的;DOM是由W3C定义的。让我们来看看这些API库。
javax.xml.parsers
JAXP API, 定义个SAX和DOM的一个通用接口
org.w3c.dom
定义了DOM中的所有组件
org.xml.sax
定义了SAX的所有API
javax.xml.transform
定义了XSLT API,使用它,你可以将XML转化为一般的可视的页面。
SAX指一种”事件驱动”的处理方式,他对XML文件连续地一个对象一个对象地操作,由于它的这个特点,所以它可以用于服务器端或者对速度有特殊要求的地方。
相比较而言DOM是个使用起来更简单些。他是将所有个XML数据全部读到内存里面,然后使用”树”结构将这些数据组织起来,用户可以对XML的数据进行任意的操作。
<opml version="1.0"> <head> <title>news-新浪rss</title> </head> <body> <outline title="新闻中心-新浪RSS" text="新闻中心-新浪RSS"> <outline text="新闻要闻" title="新闻要闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/marquee/ddt.xml" htmlUrl="www.sina.com.cn" /> <outline text="国内要闻" title="国内要闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/china/focus15.xml" htmlUrl="www.sina.com.cn" /> <outline text="国际要闻" title="国际要闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/world/focus15.xml" htmlUrl="www.sina.com.cn" /> <outline text="社会新闻" title="社会新闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/society/focus15.xml" htmlUrl="www.sina.com.cn" /> <outline text="时政要闻" title="时政要闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/china/politics15.xml" htmlUrl="www.sina.com.cn" /> <outline text="港澳台新闻" title="港澳台新闻" type="rss" xmlUrl="http://rss.sina.com.cn/news/china/hktaiwan15.xml" htmlUrl="www.sina.com.cn" /> <outline text="法制要闻" title="法制要闻" type="rss" xmlUrl="http://rss.sina.com.cn/legal/import.xml" htmlUrl="www.sina.com.cn" /> <outline text="社会与法" title="社会与法" type="rss" xmlUrl="http://rss.sina.com.cn/news/society/law15.xml" htmlUrl="www.sina.com.cn" /> <outline text="社会万象" title="社会万象" type="rss" xmlUrl="http://rss.sina.com.cn/news/society/misc15.xml" htmlUrl="www.sina.com.cn" /> <outline text="真情时刻" title="真情时刻" type="rss" xmlUrl="http://rss.sina.com.cn/news/society/feeling15.xml" htmlUrl="www.sina.com.cn" /> <outline text="奇闻轶事" title="奇闻轶事" type="rss" xmlUrl="http://rss.sina.com.cn/news/society/wonder15.xml" htmlUrl="www.sina.com.cn" /> </outline> </body> </opml>
<?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" title="XSL Formatting" href="/show_new_final.xsl" media="all"?> <!-- SINA Corporation (NASDAQ: SINA) is a leading online media company and value-added information service (VAS) provider for China and for Chinese communities worldwide. With a branded network of localized websites targeting Greater China and overseas Chinese, SINA provides services through five major business lines including SINA.com (online news and content), SINA Mobile (mobile value-added services), SINA Online (community-based services and games), SINA.net (search and enterprise services) and SINA E-commerce (online shopping), offering Internet users and government and business clients an array of services including online media and entertainment, online fee-based VAS/wireless VAS, and e-commerce and enterprise e-solutions. With 230 million registered users worldwide, 450 million daily page views and over 60 million active users for a variety of fee-based services, SINA is the most recognized Internet brand name in China and among Chinese communities globally. In various surveys and polls, SINA has been recognized as the most valuable brand and the most popular website in China. For 2003 and 2005, SINA was ranked the "Most Preferred Website" in China according to the Chinese Academy of Social Sciences and considered "The Most Respected Chinese Company" for three consecutive years in 2003, 2004 and 2005 by the Economic Observer and the Management Case Study Center of Beijing University. At the same time, South China Weekend in both 2003 and 2004 honored SINA with the prestigious award of the "Chinese Language Medium of the Year." ****[pid:1,tid:60,did:3,fid:1086]****--> <rss version="2.0"> <channel> <title> <![CDATA[新闻要闻-新浪新闻]]> </title> <image> <title> <![CDATA[新闻中心]]> </title> <link>http://news.sina.com.cn</link> <url>http://www.sinaimg.cn/home/deco/2009/0330/logo_home_news.gif</url> </image> <description> <![CDATA[新闻中心-新闻要闻]]> </description> <link>http://roll.news.sina.com.cn/s/</link> <language>zh-cn</language> <generator>WWW.SINA.COM.CN</generator> <ttl>5</ttl> <copyright> <![CDATA[Copyright 1996 - 2011 SINA Inc. All Rights Reserved]]> </copyright> <pubDate>Tue, 3 May 2011 07:32:02 GMT</pubDate> <category> <![CDATA[]]> </category> <item> <title> <![CDATA[[财经]收评:沪基指收平 两市传统封基七成下跌(05/03 15:22)]]> </title> <link>http://go.rss.sina.com.cn/redirect.php?url=http://finance.sina.com.cn/money/fund/20110503/15229784932.shtml</link> <author>WWW.SINA.COM.CN</author> <guid>http://go.rss.sina.com.cn/redirect.php?url=http://finance.sina.com.cn/money/fund/20110503/15229784932.shtml</guid> <category> <![CDATA[]]> </category> <pubDate>Tue, 3 May 2011 07:22:45 GMT</pubDate> <comments></comments> <description> <![CDATA[ 新浪财经讯 周一沪深基指低开低走,午后回稳。沪基指报收于4668.33点,与上一交易日收平;深基指报收于5792.28%点,下跌0.18%;股指方面,沪综指报2932.19点,上涨0.71%;深成指报收12423.46点,上涨0.90%。 两市传统封基七成下跌。基金通乾下跌1.22%,唯一跌幅超1%;....]]> </description> </item> </channel> </rss>
import java.io.File;
import java.io.InputStream;
import java.net.URL;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
public class TestJAXPParse {
public static void main(String[] args) throws Throwable {
String configFile = "java.home" + File.separator +
"lib" + File.separator + "jaxp.properties";
File file=new File(configFile);
System.out.println("解析器配置文件是否存在:"+file.exists());
//java默认的解析器工厂
String ps[] = { "javax.xml.parsers.DocumentBuilderFactory",
"com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"};
for (String key : ps) {
System.out.println(System.getProperty(key));
}
System.out.println("创建工厂:");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(Thread.currentThread()
.getContextClassLoader().getResourceAsStream(
"sina_news_opml.xml")));//http://rss.sina.com.cn/sina_news_opml.xml
// 获得根元素
Element root = doc.getDocumentElement();
System.out.println("根元素名称:"+root.getNodeName());
NodeList nodeList = doc.getElementsByTagName("outline");
Element outlineElement = null;
for (int i = 0; i < nodeList.getLength(); i++) {
if (nodeList.item(i) instanceof Element) {
outlineElement = (Element) nodeList.item(i);
System.out.print("text:" + outlineElement.getAttribute("text"));
System.out.println(" xmlUrl:"
+ outlineElement.getAttribute("xmlUrl"));
}
}
//解析远程文件
parseChannel("http://rss.sina.com.cn/news/marquee/ddt.xml");
}
public static void parseChannel(String urlXml) throws Throwable{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
//创建URL
URL url=new URL(urlXml);
//打开链接获得输入流
InputStream is = url.openConnection().getInputStream();
Document doc = docBuilder.parse(new InputSource(is));
NodeList nodeList= doc.getElementsByTagName("item");
Element itemElement =null;
Element itemChildElement =null;
for (int i=0;i<nodeList.getLength();i++) {
if(nodeList.item(i) instanceof Element){
itemElement =(Element)nodeList.item(i);
NodeList itemNodeChilds= itemElement.getChildNodes();
for (int j = 0; j < itemNodeChilds.getLength(); j++) {
if(itemNodeChilds.item(j) instanceof Element){
itemChildElement=(Element) itemNodeChilds.item(j);
System.out.println(itemChildElement.getTextContent().replaceAll("\\s+", ""));
}
}
}
}
}
}