爬虫爬取的网页乱码 response.encoding = "utf-8" 来解决

最新推荐文章于 2024-12-03 19:14:53 发布

原创最新推荐文章于 2024-12-03 19:14:53 发布 · 2.9w 阅读

34 ·

CC 4.0 BY-SA版权

使用requests爬取数据时，打印或保存的中文可能出现乱码。response有text和content两个属性，text是unicode码，content是字节码。可查看网页headers中charset指定的字符编码，通过设置response.encoding来匹配指定编码，解决中文乱码问题。

使用requests爬数据的时候，发现打印或者保存到文件中的中文显示为Unicode码(其实我也不知道是什么码,总之乱码)。

爬取某网 response= requests.get(“http://www.xxxxx.com/“)

我们都知道response有 text 和 content 这两个property, 它们都是指响应内容，但是又有区别。我们从doc中可以看到：

text的doc内容为：

Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

而content的doc内容为：

Content of the response, in bytes.

其中text是unicode码,content是字节码.

我们看一下网页的 headers，网页源码标签下标签中charset指定的字符编码，例如：

因此，当我们用text属性获取了html内容出现unicode码时，可以通过设置字符编码response.encoding 来匹配指定的编码，这样就不会乱码了。

import requests

response = requests.get(“http://www.xxxxx.com/“)
response.encoding = “utf-8” # 手动指定字符编码为utf-8
print(response.text)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

基础决定反应速度

关注关注

11
点赞
踩
34

收藏

觉得还不错? 一键收藏
2
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

一般情况下(utf-8编码)的go爬虫(1)

weixin_54147055的博客

08-24

482

一般情况下的go爬虫省略导包 utf-8编码的网页爬虫为例 1.get:向服务器请求资源地址，返回http页面的响应 2.判断response的type,若为200即可 3.通过ioutil.ReadAll()读取response Body func main(){ resp,err:=http.Get("https://www.qidian.com/") if err!=nil{ panic(err) } defer resp.Body.Close() resp.StatusCode=10

Python爬虫-存储到csv乱码-使用utf-8-sig编码

不是七七子的博客

02-24

691

在PyCharm编辑器中，打开是正常的，在open()函数中将编码方式修改为。直接打开csv文件，却是乱码一通。

2 条评论您还未登录，请先登录后发表或查看评论

2 条评论

Tony Einstein 2022.05.26
response.apparent_encoding是'ascii',而response.encoding是'utf-8'导致爬虫返回的数据乱码，这不是这样解决的

小牛头# 2019.03.09
有一个要注意的是本身网页的charset，假如你要爬取得的网页<meta charset="gb2312"&gt[code=python] import requests response = requests.get(“http://www.xxxxx.com/“) response.encoding = “gb2312” # 手动指定字符编码为utf-8 print(response.text) 即可解决 [/code]