对于有些网页,使用HttpURLConnection 的 getHeaderFields().get("Content-Type") 能正确获得网页的编码。但有些网页却不能正确地获得,通过查看服务器返回的头信息,发现了问题所在,
http://it.sohu.com/20090711/n265142337.shtml 响应头
null : [HTTP/1.0 200 OK]
Date : [Sun, 12 Jul 2009 14:42:20 GMT]
Vary : [Accept-Encoding]
Expires : [Sun, 12 Jul 2009 14:44:20 GMT]
Last-Modified : [Sun, 12 Jul 2009 14:40:10 GMT]
Via : [1.0 33107306.44707239.41988739.sohu.com:80 (squid)]
Content-Type : [text/html]
Connection : [close]
Server : [Apache/1.3.39 (Unix) mod_gzip/1.3.26.1a]
X-Cache : [MISS from 33107306.44707239.41988739.sohu.com]
Cache-Control : [max-age=120]
红色的部分是关键,服务器给予的反应并没有把字符编码加进来。所以,使用上面的办法是没有办法获得编码的。
而对于另一个网页,却有字符编辑,见下:
http://news.xinhuanet.com/world/2009-04/27/content_11267743.htm 响应头
null : [HTTP/1.0 200 OK]
ETag : ["47f8ae1-4909-fb3f2080"]
Very : [Accept-Encoding]
Content-Length : [4973]
Last-Modified : [Mon, 27 Apr 2009 09:29:22 GMT]
Connection : [keep-alive]
Powered-By-ChinaCache : [CNC-LY-7-3BB HIT, CNC-QD-K-33U HIT]
X-Cache : [HIT from news48.xinhuanet.com]
Server : [Apache]
Date : [Sat, 27 Jun 2009 07:22:49 GMT]
Content-Encoding : [gzip]
Via : [1.0 news48.xinhuanet.com:80 (squid/2.6.STABLE16)]
Content-Type : [text/html; charset=GB2312]
Accept-Ranges : [bytes]
综上所述,要获得一个网页的编码,上述办法不一定满足条件,最保险的办法是分析网页内容。