关于Python 中unicode 转码的问题

最新推荐文章于 2023-09-18 15:12:44 发布

转载最新推荐文章于 2023-09-18 15:12:44 发布 · 208 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/yuyezhulan/p/4010988.html

文章标签：

#python

本文介绍了使用Python的urllib2模块抓取含有中文的网页时遇到的编码问题及解决方案。通过检测网页的实际编码格式，并利用BeautifulSoup进行解析，确保了获取到的内容正确无误。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python 中urllib2.urlopen 中存在中文转码问题，解决方法如下：

import BeautifulSoup
import chardet

response =urllib2.urlopen('%s'%line)
#response.decode('utf-8')
#response = urllib2.urlopen('http://www.baidu.com/')
html = response.read()
pdb.set_trace()
#print html.decode('big5').encode('utf8')
urlcodestyle=chardet.detect(html)
sourcehtml=html.decode('%s'%urlcodestyle['encoding']).encode('utf-8')

2.sourcehtml 的使用方法：

import BeautifulSoup
"""
if 'encoding' in urlcodestyle:
soup=BeautifulSoup(html,fromEncoding="%s"%urlcodestyle['encoding'])
else :
soup=BeautifulSoup(html,fromEncoding="gb18030")
"""

最好能够通过获得请求页面的编码格式，然后再对fromEncoding 进行赋值

转载于:https://www.cnblogs.com/yuyezhulan/p/4010988.html