NekoHTML解析HTML为XML后TagName一直为大写的问题解决

本文链接：https://blog.youkuaiyun.com/kydkong/article/details/78021838

NekoHTML遵循HTML 4规范，即使设置"http://cyberneko.org/html/properties/names/elems"为"lower"，元素名称仍会大写。为解决此问题，可以使用Xerces2 DOMParser，通过NekoHTML解析器配置创建标准XML DOM树，元素和属性名称将根据相应属性设置。示例代码展示了如何实例化DOM解析器。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题：

java使用NekoHTML解析HTML的时候发现NekoHTML总是把标签名转换成大写，导致之前写的XPath都用不了，虽然可以用脚本把之前的历史XPath都转换一遍，但是如果新来的运营不知道的话，还是可能会出现不必要的麻烦。

分析：

在网上一顿搜索，发现自己的blog里也有写，只是之前没有注意，NekoHTML提供了一些配置项，可以精确的配置NekoHTML的行为。

与我们这个问题相关的配置是：

DOMParser parser = new DOMParser();
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
//解析HTML文件
parser.parse("http://www.baidu.com");
 //获取解析后的DOM树
Document document = parser.getDocument();

设置以后发现竟然没有用，关键是NekoHTML的官网也上不去，不知道是被墙了还是怎么。后来幸好在github找到一份镜像，找到了文档。

文档中这么写着：

Why are the DOM element names always uppercase?

The HTML DOM specification explicitly states that element and attribute names follow the semantics, including case-sensitivity, specified in the HTML 4specification. In addition, section 1.2.1 of the HTML 4.01 specification states:

Element names are written in uppercase letters (e.g., BODY). Attribute names are written in lowercase letters (e.g., lang, onsubmit).

The Xerces HTML DOM implementation (used by default in the NekoHTML DOMParser class) follows this convention. Therefore, even if the "http://cyberneko.org/html/properties/names/elems" property is set to "lower", the DOM will still uppercase the element names.

To get around this problem, instantiate a Xerces2 DOMParser object using the NekoHTML parser configuration. By default, the Xerces DOM parser class creates a standard XML DOM tree, not an HTML DOM tree. Therefore, the element and attribute names will follow the settings for the "http://cyberneko.org/html/properties/names/elems" and "http://cyberneko.org/html/properties/names/attrs" properties. However, realize that the application will not be able to cast the document nodes to the HTML DOM interfaces for accessing the document's information.

The following sample code shows how to instantiate a DOM parser using the NekoHTML parser configuration:

// import org.apache.xerces.parsers.DOMParser;
// import org.cyberneko.html.HTMLConfiguration;

DOMParser parser = new DOMParser(new HTMLConfiguration());

大意就是说为了符合HTML4.01标准，NekoHTML会将TagName转换为大写，无论是否设置刚才说的配置项。解决办法就是使用

org.apache.xerces.parsers.DOMParser代替原来的DOMParser。具体代码看下面的解决方案吧

解决方案：

直接插代码了：

        HTMLConfiguration htmlConfiguration = new HTMLConfiguration();
        htmlConfiguration.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
        org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser(htmlConfiguration);
        InputSource inputSource = new InputSource("http://www.baidu.com");
        parser.parse(inputSource);
        System.out.println(parser.getXMLParserConfiguration().getProperty("http://cyberneko.org/html/properties/names/elems"));
        //获取解析后的DOM树
        Document document = parser.getDocument();
        String xml = new XMLDocument(document).toString();
        System.out.println(xml);

附上pom.xml的相关依赖

        <dependency>
            <groupId>net.sourceforge.nekohtml</groupId>
            <artifactId>nekohtml</artifactId>
            <version>1.9.22</version>
        </dependency>