[size=medium]
解决办法:
修改htmlparser.jar中的org.htmlparser.tags.MetaTag.java,修改如下:
或者:
1.采用httpclient解析网页并返回html
2.直接调用 paser.setInputHTML(httpclient返回的html);
org.htmlparser.util.EncodingChangeException: character mismatch (new: — [0x2014] != old: [0x2015―]) for encoding change from GB2312 to GBK at character offset 16022
at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280)
at org.htmlparser.lexer.Page.setEncoding(Page.java:865)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at org.htmlparser.Parser.parse(Parser.java:701)
at crawl.ParserLink.getLink(ParserLink.java:36)
at main.Test.main(Test.java:93)
解决办法:
修改htmlparser.jar中的org.htmlparser.tags.MetaTag.java,修改如下:
public void doSemanticAction() throws ParserException {
String httpEquiv;
String charset;
httpEquiv = getHttpEquiv();
if ("Content-Type".equalsIgnoreCase(httpEquiv)) {
if (Page.DEFAULT_CHARSET == getPage().getEncoding()) {
charset = getPage().getCharset(getAttribute("CONTENT"));
getPage().setEncoding(charset);
}
}
}
[/size]或者:
1.采用httpclient解析网页并返回html
2.直接调用 paser.setInputHTML(httpclient返回的html);