HtmlParser使用心得

最新推荐文章于 2012-11-26 13:20:21 发布

原创最新推荐文章于 2012-11-26 13:20:21 发布 · 2.3k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#parsing #html #脚本 #javascript #browser #comments

解析HTML时遇到脚本和注释的特殊处理问题，通过调整HTMLParser的参数来应对不规范的HTML结构。

用 htmlparser1.6 解析html

在解析了大量的html测试后发现了htmlparser的问题，

称之为问题也不算是问题，因为htmlparser本身也预留了解决途径，

其实这两个问题属于同一种问题：

问题1：

当解析<script>脚本时，如果脚本文本中出现如 '<span></span>' 时，

当走到 '</' 他会认为到了</script>,脚本分析完毕，导致后面的脚本文本被漏掉。

问题2：

当解析  ,默认情况下，htmlparser只识别  这种类型的注释代码，

如果是: '' ，它会关不掉，导致后面的html内容也被当做注释。

这两种问题htmlparser设计者都想到了，也预留了解决途径，只是没找到合适的解决途径。

解决1：

org.htmlparser.scanners.ScriptScanner.STRICT = false;

解决2：

org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

org.htmlparser.scanners.ScriptScanner.STRICT = false的官方解释：

/** * Strict parsing of CDATA flag. * If this flag is set true, the parsing of script is performed without * regard to quotes. This means that erroneous script such as: * <pre> * document.write("</script>"); * </pre> * will be parsed in strict accordance with appendix * <a href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data" mce_href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> * B.3.2 Specifying non-HTML data</a> of the * <a href="http://www.w3.org/TR/html4/" mce_href="http://www.w3.org/TR/html4/">HTML 4.01 Specification</a> and * hence will be split into two or more nodes. Correct javascript would * escape the ETAGO: * <pre> * document.write("<//script>"); * </pre> * If true, CDATA parsing will stop at the first ETAGO ("</") no matter * whether it is quoted or not. If false, balanced quotes (either single or * double) will shield an ETAGO. Beacuse of the possibility of quotes within * single or multiline comments, these are also parsed. In most cases, * users prefer non-strict handling since there is so much broken script * out in the wild. */

org.htmlparser.lexer.Lexer.STRICT_REMARKS 的官方解释：

/** * Process remarks strictly flag. * If <code>true</code>, remarks are not terminated by ---$gt; * or --!$gt;, i.e. more than two dashes. If <code>false</code>, * a more lax (and closer to typical browser handling) remark parsing * is used. * Default <code>true</code>. */

在默认情况下，htmlparser解析是按严格的html标准解析，所以当碰到不标准的标签有可能出错，

当把以上这两个参数改变以后，htmlparser解析不再严格，能应对所有可能出现的情况。