python 2.6/2.7 Requests网页编码问题

本文探讨使用Python Requests库抓取中文网页时遇到的乱码问题,并提供了解决方案。通过对Requests源码的分析,介绍如何正确识别网页编码。

相关文章:

Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写,更友好,更易用。

Requests

Requests 使用的是 urllib3,因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池,支持使用 cookie 保持会话,支持文件上传,支持自动确定响应内容的编码,支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。

最近在使用Requests的过程中发现一个问题,就是抓去某些中文网页的时候,出现乱码,打印encoding是ISO-8859-1。为什么会这样呢?通过查看源码,我发现默认的编码识别比较简单,直接从响应头文件的Content-Type里获取,如果存在charset,则可以正确识别,如果不存在charset但是存在text就认为是ISO-8859-1,见utils.py。

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """
    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

其实Requests提供了从内容获取编码,只是在默认中没有使用,见utils.py:

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

还提供了使用chardet的编码检测,见models.py:

@property
def apparent_encoding(self):
    """The apparent encoding, provided by the lovely Charade library
    (Thanks, Ian!)."""
    return chardet.detect(self.content)['encoding']

如何修复这个问题呢?先来看一下示例:

>>> r = requests.get('http://cn.python-requests.org/en/latest/')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> requests.utils.get_encodings_from_content(r.content)
['utf-8']

>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'gb2312'
>>> requests.utils.get_encodings_from_content(r.content)
['gb2312']

以下对这个问题进行分析并提供解决的方法:

分析requests的源代码发现,text返回的是处理过的Unicode型的数据,而使用content返回的是bytes型的原始数据。也就是说,r.content相对于r.text来说节省了计算资源,content是把内容bytes返回. 而text是decode成Unicode. 如果headers没有charset字符集的化,text()会调用chardet来计算字符集.

《HTTP权威指南》里第16章国际化里提到,如果HTTP响应中Content-Type字段没有指定charset,则默认页面是’ISO-8859-1’编码。这处理英文页面当然没有问题,但是中文页面,就会有乱码了!

在确定使用text前,已经得知该站的字符集编码时,可以使用 r.encoding = ‘xxx’ 模式, 当你指定编码后,requests在text时会根据你设定的字符集编码进行转换. 使用apparent_encoding可以获得真实编码,这是程序自己分析的,会比较慢。还可以从html的meta中抽取,如:requests.utils.get_encodings_from_content(response.text)

if r.encoding == 'ISO-8859-1':
    encodings = requests.utils.get_encodings_from_content(r.content)
    if encodings:
        r.encoding = encodings[0]
    else:
        r.encoding = r.apparent_encoding
    r._content = r.content.decode(r.encoding, 'replace').encode('utf8', 'replace')
su -s /bin/sh -c "keystone-manage db_sync" keystone /usr/lib/python2.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.10.2) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning) Traceback (most recent call last): File "/usr/bin/keystone-manage", line 6, in <module> from keystone.cmd.manage import main File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 19, in <module> from keystone.cmd import cli File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 29, in <module> from keystone.cmd import bootstrap File "/usr/lib/python2.7/site-packages/keystone/cmd/bootstrap.py", line 17, in <module> from keystone.common import driver_hints File "/usr/lib/python2.7/site-packages/keystone/common/driver_hints.py", line 18, in <module> from keystone import exception File "/usr/lib/python2.7/site-packages/keystone/exception.py", line 20, in <module> import keystone.conf File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 20, in <module> from osprofiler import opts as profiler File "/usr/lib/python2.7/site-packages/osprofiler/opts.py", line 18, in <module> from osprofiler import web File "/usr/lib/python2.7/site-packages/osprofiler/web.py", line 20, in <module> from osprofiler import profiler File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 27, in <module> from osprofiler import notifier File "/usr/lib/python2.7/site-packages/osprofiler/notifier.py", line 16, in <module> from osprofiler.drivers import base File "/usr/lib/python2.7/site-packages/osprofiler/drivers/__init__.py", line 4, in <module> from osprofiler.drivers import loginsight # noqa File "/usr/lib/python2.7/site-packages/osprofiler/drivers/loginsight.py", line 25, in <module> import requests File "/usr/lib/python2.7/site-packages/requests/__init__.py", line 94, in <module> from urllib3.exceptions import DependencyWarning ImportError: cannot import name DependencyWarning [root@controller ~]# su -s /bin/sh -c "keystone-manage db_sync" keystone /usr/lib/python2.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.10.2) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning) Traceback (most recent call last): File "/usr/bin/keystone-manage", line 6, in <module> from keystone.cmd.manage import main File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 19, in <module> from keystone.cmd import cli File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 29, in <module> from keystone.cmd import bootstrap File "/usr/lib/python2.7/site-packages/keystone/cmd/bootstrap.py", line 17, in <module> from keystone.common import driver_hints File "/usr/lib/python2.7/site-packages/keystone/common/driver_hints.py", line 18, in <module> from keystone import exception File "/usr/lib/python2.7/site-packages/keystone/exception.py", line 20, in <module> import keystone.conf File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 20, in <module> from osprofiler import opts as profiler File "/usr/lib/python2.7/site-packages/osprofiler/opts.py", line 18, in <module> from osprofiler import web File "/usr/lib/python2.7/site-packages/osprofiler/web.py", line 20, in <module> from osprofiler import profiler File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 27, in <module> from osprofiler import notifier File "/usr/lib/python2.7/site-packages/osprofiler/notifier.py", line 16, in <module> from osprofiler.drivers import base File "/usr/lib/python2.7/site-packages/osprofiler/drivers/__init__.py", line 4, in <module> from osprofiler.drivers import loginsight # noqa File "/usr/lib/python2.7/site-packages/osprofiler/drivers/loginsight.py", line 25, in <module> import requests File "/usr/lib/python2.7/site-packages/requests/__init__.py", line 94, in <module> from urllib3.exceptions import DependencyWarning ImportError: cannot import name DependencyWarning 您在 /var/spool/mail/root 中有邮件 [root@controller ~]# su -s /bin/sh -c "keystone-manage db_sync" keystone /usr/lib/python2.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.10.2) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning) Traceback (most recent call last): File "/usr/bin/keystone-manage", line 6, in <module> from keystone.cmd.manage import main File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 19, in <module> from keystone.cmd import cli File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 29, in <module> from keystone.cmd import bootstrap File "/usr/lib/python2.7/site-packages/keystone/cmd/bootstrap.py", line 17, in <module> from keystone.common import driver_hints File "/usr/lib/python2.7/site-packages/keystone/common/driver_hints.py", line 18, in <module> from keystone import exception File "/usr/lib/python2.7/site-packages/keystone/exception.py", line 20, in <module> import keystone.conf File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 20, in <module> from osprofiler import opts as profiler File "/usr/lib/python2.7/site-packages/osprofiler/opts.py", line 18, in <module> from osprofiler import web File "/usr/lib/python2.7/site-packages/osprofiler/web.py", line 20, in <module> from osprofiler import profiler File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 27, in <module> from osprofiler import notifier File "/usr/lib/python2.7/site-packages/osprofiler/notifier.py", line 16, in <module> from osprofiler.drivers import base File "/usr/lib/python2.7/site-packages/osprofiler/drivers/__init__.py", line 4, in <module> from osprofiler.drivers import loginsight # noqa File "/usr/lib/python2.7/site-packages/osprofiler/drivers/loginsight.py", line 25, in <module> import requests File "/usr/lib/python2.7/site-packages/requests/__init__.py", line 94, in <module> from urllib3.exceptions import DependencyWarning ImportError: cannot import name DependencyWarning
最新发布
06-08
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值