使用python爬取小说时文字乱码/爬取图片标题时，中文汉字乱码的解决方法

ppdd·~

已于 2024-01-13 23:01:50 修改

阅读量276

点赞数

分类专栏： python 文章标签： python 爬虫

于 2021-04-03 18:42:41 首次发布

本文链接：https://blog.youkuaiyun.com/qq_51007474/article/details/115419132

版权

python 专栏收录该内容

13 篇文章

订阅专栏

本文介绍了在使用Python爬虫抓取网页内容时遇到的乱码问题及其解决方案。针对文字乱码，可以通过手动设置响应数据的编码格式为'utf_8'，或者使用'response.apparent_encoding'来自动识别编码。对于图片标题中的中文乱码，可以使用encode('iso-8859-1')解码为gbk编码。通过这些方法，可以有效解决爬取过程中遇到的编码问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用python爬取小说时文字乱码/爬取图片标题时，中文汉字乱码的解决方法

示例：

爬取链接: https://pic.netbian.com/4kbeijing/.中的背景图片

原始代码：

url = 'https://pic.netbian.com/4kbeijing/'
page_text = requests.get(url,headers).text
#数据解析:src的属性值   alt 的属性值
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]ul/li')

解决方案

一、

response = requests.get(url,headers)
response.encoding = 'utf_8'  #手动设置响应数据的编码格式
page_text = response.text

这种方法可能会失效，所以不推荐

二、

response = requests.get(url,headers)
response.encoding = response.apparent_encoding
page_text = response.text

原因：
encoding是从http中的header中的charset字段中提取的编码方式，若header中没有charset字段则默认为ISO-8859-1编码模式，则无法解析中文，这是乱码的原因
apparent_encoding会从网页的内容中分析网页编码的方式，所以apparent_encoding比encoding更加准确。当网页出现乱码时可以把apparent_encoding的编码格式赋值给encoding。

直接在对乱码对象进行修改

    img_nmae = li.xpath('./a/img/@alt')[0]+'.jpg'
    #通用处理中文乱码的方法
    img_nmae = img_nmae.encode('iso-8859-1').decode('gbk')

将ISO-8859-1编码模式转换为gbk编码模式