python chardet

最新推荐文章于 2024-09-25 08:15:00 发布

weixin_30932215

最新推荐文章于 2024-09-25 08:15:00 发布

阅读量148

点赞数

CC 4.0 BY-SA版权

文章标签： python 数据库

原文链接：http://www.cnblogs.com/yanxiatingyu/p/10219597.html


chardet:字符编码检测工具

字符串编码一直是令人非常头疼的问题，尤其是我们在处理一些不规范的第三方网页的时候。虽然Python提供了Unicode表示的str和bytes两种数据类型，并且可以通过encode()和decode()方法转换，但是，在不知道编码的情况下，对bytes做decode()不好做。

对于未知编码的bytes，要把它转换成str，需要先“猜测”编码。猜测的方式是先收集各种编码的特征字符，根据特征字符判断，就能有很大概率“猜对”。

当然，我们肯定不能从头自己写这个检测编码的功能，这样做费时费力。chardet这个第三方库正好就派上了用场。用它来检测编码，简单易用。


安装:
　　pip install chardet 
　　
官方文档　　　　  :　　https://chardet.readthedocs.io/en/latest/
更多　　　　　　  :　　https://pypi.org/project/chardet/
支持的编码　　　　:　　https//chardet.readthedocs.io/en/latest/supported-encodings.html
chardet module :　　https://chardet.readthedocs.io/en/latest/api/modules.html

使用
import urllib
import chardet
rawdata =urllib.urlopen('http://yahoo.co.jp/').read()

chardet.detect(rawdata)
>>:{'encoding':'EUC-JP','confidence':0.pp}

import redis
rds = redis()
rds.set('user_info','这是一串不怎么什么时候存入不知道谁存入，什么情况下的字符串')
user_info = rds.get('user_info')
chardet.detect(user_info)
>>{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


{
"encoding":"字符编码",
"confidence":"检测概率,最大为1，即100%，最小为"
}