由于html文件不是标准的xml文件
比如:
<OBJECT type="text/site properties">
<param name="Window Styles" value="0x800025">
<param name="comment" value="title:Online Help">
<param name="comment" value="base:index.htm">
</OBJECT>
<param>标签不是成对出现的,所以用dom4j是不能解析的
解决方法:
利用cobra.jar架包先把html文件解析成org.w3c.dom.Document类型的dom
然后利用dom4j的方法
DOMReader xmlReader = new DOMReader();
xmlReader.read(org.w3c.dom.Document document)
代码:
UserAgentContext uacontext = new SimpleUserAgentContext();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream in = new FileInputStream("F:/iKnow/chmimport/U2000FAQ_WebHome_123/U2000.hhc");
Reader reader = new InputStreamReader(in, "GBK");
org.w3c.dom.Document document = builder.newDocument();
// Here is where we use Cobra's HTML parser.
HtmlParser parser = new HtmlParser(uacontext, document);
parser.parse(reader);
DOMReader xmlReader = new DOMReader();
System.out.println(xmlReader.read(document).asXML());
还有一个jsoup
http://www.ibm.com/developerworks/cn/java/j-lo-jsouphtml/
比如:
<OBJECT type="text/site properties">
<param name="Window Styles" value="0x800025">
<param name="comment" value="title:Online Help">
<param name="comment" value="base:index.htm">
</OBJECT>
<param>标签不是成对出现的,所以用dom4j是不能解析的
解决方法:
利用cobra.jar架包先把html文件解析成org.w3c.dom.Document类型的dom
然后利用dom4j的方法
DOMReader xmlReader = new DOMReader();
xmlReader.read(org.w3c.dom.Document document)
代码:
UserAgentContext uacontext = new SimpleUserAgentContext();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream in = new FileInputStream("F:/iKnow/chmimport/U2000FAQ_WebHome_123/U2000.hhc");
Reader reader = new InputStreamReader(in, "GBK");
org.w3c.dom.Document document = builder.newDocument();
// Here is where we use Cobra's HTML parser.
HtmlParser parser = new HtmlParser(uacontext, document);
parser.parse(reader);
DOMReader xmlReader = new DOMReader();
System.out.println(xmlReader.read(document).asXML());
还有一个jsoup
http://www.ibm.com/developerworks/cn/java/j-lo-jsouphtml/