htmlParser 繁体字乱码

最新推荐文章于 2023-01-05 10:32:15 发布

CCStory

最新推荐文章于 2023-01-05 10:32:15 发布

阅读量1.2k

点赞数

分类专栏： JAVA 文章标签： htmlparse 乱码

本文链接：https://blog.youkuaiyun.com/CCStory/article/details/8687648

版权

JAVA 专栏收录该内容

10 篇文章

订阅专栏

htmlParser在解析一些字符集是GB2312的网站时，会出现繁体字乱码的现象。原因是GB2312不支持繁体字，我们只要把字体设置成支持繁体字的GBK就OK了。

通过阅读源码我发现parser.setEncoding(encoding);这个方法修改编码是没有用的，而在Page.java中我发现了这个修改涉及修改的编码方法

/**
* Get a CharacterSet name corresponding to a charset parameter.
* @param content A text line of the form:
* <pre>

* text/html; charset=Shift_JIS

* </pre>
* which is applicable both to the HTTP header field Content-Type and
* the meta tag http-equiv="Content-Type".
* Note this method also handles non-compliant quoted charset directives
* such as:
* <pre>
* text/html; charset="UTF-8"
* </pre>
* and
* <pre>
* text/html; charset='UTF-8'
* </pre>
* @return The character set name to use when reading the input stream.
* For JDKs that have the Charset class this is qualified by passing
* the name to findCharset() to render it into canonical form.
* If the charset parameter is not found in the given string, the default
* character set is returned.
* @see #findCharset
* @see #DEFAULT_CHARSET
*/
public String getCharset (String content)
{
final String CHARSET_STRING = "charset";
int index;
String ret;

if (null == mSource)
ret = DEFAULT_CHARSET;
else
// use existing (possibly supplied) character set:
// bug #1322686 when illegal charset specified
ret = mSource.getEncoding ();
if (null != content)
{
index = content.indexOf (CHARSET_STRING);

if (index != -1)
{
content = content.substring (index +
CHARSET_STRING.length ()).trim ();
if (content.startsWith ("="))
{
content = content.substring (1).trim ();
index = content.indexOf (";");
if (index != -1)
content = content.substring (0, index);

//remove any double quotes from around charset string
if (content.startsWith ("\"") && content.endsWith ("\"")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);

//remove any single quote from around charset string
if (content.startsWith ("'") && content.endsWith ("'")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);

ret = findCharset (content, ret);

// Charset names are not case-sensitive;
// that is, case is always ignored when comparing
// charset names.
// if (!ret.equalsIgnoreCase (content))
// {
// System.out.println (
// "detected charset \""
// + content
// + "\", using \""
// + ret
// + "\"");
// }
}
}
}
return (ret);
}