探测python中字符集模块chardet

本文介绍如何使用chardet库检测网页编码,包括基本使用和高级使用两种方式。基本使用通过detect函数进行,而高级使用则适用于大量文本,可以增量检测并报告结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

文档:http://chardet.feedparser.org/docs/


其实还有一个:


isinstance(s, str) 用来判断是否为一般字符串

isinstance(s, unicode) 用来判断是否为unicode

下面才是正文:

Basic usage

The easiest way to use the Universal Encoding Detector library is with the detect function.


Example: Using the detect function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

import urllib
rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
mport chardet
print (chardet.detect(rawdata))
#{'encoding': 'EUC-JP', 'confidence': 0.99}



Advanced usage

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as the chardet.detect function returns).


Example: Detecting encoding incrementally

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print (detector.result)

#{'encoding': 'EUC-JP', 'confidence': 0.99}


If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.


Example: Detecting encodings of multiple files


import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print (filename.ljust(60))
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print (detector.result)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值