问题:
分析:
DOMParser parser = new DOMParser();
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
//解析HTML文件
parser.parse("http://www.baidu.com");
//获取解析后的DOM树
Document document = parser.getDocument();
设置以后发现竟然没有用,关键是NekoHTML的官网也上不去,不知道是被墙了还是怎么。后来幸好在github找到一份镜像,找到了文档。
Why are the DOM element names always uppercase?
The HTML DOM specification explicitly states that element and attribute names follow the semantics, including case-sensitivity, specified in the HTML 4specification. In addition, section 1.2.1 of the HTML 4.01 specification states:
Element names are written in uppercase letters (e.g., BODY). Attribute names are written in lowercase letters (e.g., lang, onsubmit).
The Xerces HTML DOM implementation (used by default in the NekoHTML DOMParser
class) follows this convention. Therefore, even if the "http://cyberneko.org/html/properties/names/elems" property is set to "lower", the DOM will still uppercase the element names.
To get around this problem, instantiate a Xerces2 DOMParser
object using the NekoHTML parser configuration. By default, the Xerces DOM parser class creates a standard XML DOM tree, not an HTML DOM tree. Therefore, the element and attribute names will follow the settings for the "http://cyberneko.org/html/properties/names/elems" and "http://cyberneko.org/html/properties/names/attrs" properties. However, realize that the application will not be able to cast the document nodes to the HTML DOM interfaces for accessing the document's information.
The following sample code shows how to instantiate a DOM parser using the NekoHTML parser configuration:
// import org.apache.xerces.parsers.DOMParser;
// import org.cyberneko.html.HTMLConfiguration;
DOMParser parser = new DOMParser(new HTMLConfiguration());
大意就是说为了符合HTML4.01标准,NekoHTML会将TagName转换为大写,无论是否设置刚才说的配置项。解决办法就是使用
org.apache.xerces.parsers.DOMParser代替原来的DOMParser。具体代码看下面的解决方案吧
解决方案:
HTMLConfiguration htmlConfiguration = new HTMLConfiguration();
htmlConfiguration.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser(htmlConfiguration);
InputSource inputSource = new InputSource("http://www.baidu.com");
parser.parse(inputSource);
System.out.println(parser.getXMLParserConfiguration().getProperty("http://cyberneko.org/html/properties/names/elems"));
//获取解析后的DOM树
Document document = parser.getDocument();
String xml = new XMLDocument(document).toString();
System.out.println(xml);
附上pom.xml的相关依赖
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.22</version>
</dependency>