爬baidu正常,但换成QQ,出现乱码
import urllib.request
url="http://www.qq.com"
req=urllib.request.Request(url)
req.add_header("User-Agent",'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36')
data=urllib.request.urlopen(req).read()
fhandle=open("D:/DB/4.html",'wb')
fhandle.write(data)
fhandle.close()
取消报头,还是乱码
http://blog.youkuaiyun.com/apple9005/article/details/52831318
按网上查找的要点,
安装: pip install chardet
官方网站: http://chardet.readthedocs.io/en/latest/usage.html
import chardet
mychar=chardet.detect(data)
print(mychar)
bm=mychar['encoding']
print(bm)
if bm == 'utf-8' or bm == 'UTF-8':
data2=data.decode('utf-8','ignore').encode('utf-8')
else:
data2=data.decode('GB2312','ignore').encode('utf-8')
fhandle=open("D:/DB/3.html",'wb')
fhandle.write(data2)
fhandle.close()
@_@ 没乱码的,还是没乱码,加上chardet 反而变乱码了
有乱码的,还上没有解决
if bm == 'gb2312' or bm == 'GB2312':
data2=data.decode('GB2312','ignore').encode('GB2312')
else:
data2=data.decode('utf-8','ignore').encode('utf-8')