htmlParser在解析一些字符集是GB2312的网站时,会出现繁体字乱码的现象。原因是GB2312不支持繁体字,我们只要把字体设置成支持繁体字的GBK就OK了。
通过阅读源码我发现parser.setEncoding(encoding);这个方法修改编码是没有用的,而在Page.java中我发现了这个修改涉及修改的编码方法
/**
* Get a CharacterSet name corresponding to a charset parameter.
* @param content A text line of the form:
* <pre>
* text/html; charset=Shift_JIS
* </pre>
* which is applicable both to the HTTP header field Content-Type and
* the meta tag http-equiv="Content-Type".
* Note this method also handles non-compliant quoted charset directives
* such as:
* <pre>
* text/html; charset="UTF-8"
* </pre>
* and
* <pre>
* text/html; charset='UTF-8'
* </pre>
* @return The character set name to use when reading the input stream.
* For JDKs that have the Charset class this is qualified by passing
* the name to findCharset() to render it into canonical form.
* If the charset parameter is not found in the given string, the default
* character set is returned.
* @see #findCharset
* @see #DEFAULT_CHARSET
*/
public String getCharset (String content)
{
final String CHARSET_STRING = "charset";
int index;
String ret;
if (null == mSource)
ret = DEFAULT_CHARSET;
else
// use existing (possibly supplied) character set:
// bug #1322686 when illegal charset specified
ret = mSource.getEncoding ();
if (null != content)
{
index = content.indexOf (CHARSET_STRING);
if (index != -1)
{
content = content.substring (index +
CHARSET_STRING.length ()).trim ();
if (content.startsWith ("="))
{
content = content.substring (1).trim ();
index = content.indexOf (";");
if (index != -1)
content = content.substring (0, index);
//remove any double quotes from around charset string
if (content.startsWith ("\"") && content.endsWith ("\"")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);
//remove any single quote from around charset string
if (content.startsWith ("'") && content.endsWith ("'")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);
ret = findCharset (content, ret);
// Charset names are not case-sensitive;
// that is, case is always ignored when comparing
// charset names.
// if (!ret.equalsIgnoreCase (content))
// {
// System.out.println (
// "detected charset \""
// + content
// + "\", using \""
// + ret
// + "\"");
// }
}
}
}
return (ret);
}
这个方法会把编码设置成网页chartset的编码格式
我们只要在方法末尾加上
if (ret.equals("GB2312")) {
ret = "GBK";
}
就可以了