python 检测编码 Universal Encoding Detector

本文介绍了一个使用Python进行文件编码检测的实用工具——UniversalEncodingDetector。该工具提供了两种使用方式:简单快捷的detect函数和适用于大量文本的增量检测方法。通过实例展示了如何检测单一文件和多个文件的编码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

用python检测文件的编码

Universal Encoding Detector是一个很好的工具,网址是:http://chardet.feedparser.org/

用起来很方便

Usage
[link] Basic usage

The easiest way to use the Universal Encoding Detector library is with the detect function.
[link]
Example: Using the detect function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

[link] Advanced usage

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as the chardet.detect function returns).
[link]
Example: Detecting encoding incrementally

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result

{'encoding': 'EUC-JP', 'confidence': 0.99}

If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.
[link]
Example: Detecting encodings of multiple files

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print filename.ljust(60),
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print detector.result

ann@ann:~$ dpkg -l | grep python3 ii libpython3-dev:amd64 3.12.3-0ubuntu2 amd64 header files and a static library for Python (default) ii libpython3-stdlib:amd64 3.12.3-0ubuntu2 amd64 interactive high-level object-oriented language (default python3 version) ii libpython3.10-minimal:amd64 3.10.4-3 amd64 Minimal subset of the Python language (version 3.10) ii libpython3.10-stdlib:amd64 3.10.4-3 amd64 Interactive high-level object-oriented language (standard library, version 3.10) ii libpython3.12-dev:amd64 3.12.3-1ubuntu0.7 amd64 Header files and a static library for Python (v3.12) ii libpython3.12-minimal:amd64 3.12.3-1ubuntu0.7 amd64 Minimal subset of the Python language (version 3.12) ii libpython3.12-stdlib:amd64 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (standard library, version 3.12) ii libpython3.12t64:amd64 3.12.3-1ubuntu0.7 amd64 Shared Python runtime library (version 3.12) ii python3 3.12.3-0ubuntu2 amd64 interactive high-level object-oriented language (default python3 version) ii python3-apport 2.28.1-0ubuntu3.7 all Python 3 library for Apport crash report handling ii python3-apt 2.7.7ubuntu4 amd64 Python 3 interface to libapt-pkg ii python3-aptdaemon 1.1.1+bzr982-0ubuntu44 all Python 3 module for the server and client of aptdaemon ii python3-aptdaemon.gtk3widgets 1.1.1+bzr982-0ubuntu44 all Python 3 GTK+ 3 widgets to run an aptdaemon client ii python3-attr 23.2.0-2 all Attributes without boilerplate (Python 3) ii python3-babel 2.10.3-3build1 all tools for internationalizing Python applications - Python 3.x ii python3-bcrypt 3.2.2-1build1 amd64 password hashing library for Python 3 ii python3-blinker 1.7.0-1 all Fast, simple object-to-object and broadcast signaling (Python3) ii python3-bpfcc 0.29.1+ds-1ubuntu7 all Python 3 wrappers for BPF Compiler Collection (BCC) ii python3-brlapi:amd64 6.6-4ubuntu5 amd64 Braille display access via BRLTTY - Python3 bindings ii python3-cairo 1.25.1-2build2 amd64 Python3 bindings for the Cairo vector graphics library ii python3-certifi 2023.11.17-1 all root certificates for validating SSL certs and verifying TLS hosts (python3) ii python3-cffi-backend:amd64 1.16.0-2build1 amd64 Foreign Function Interface for Python 3 calling C code - runtime ii python3-chardet 5.2.0+dfsg-1 all Universal Character Encoding Detector (Python3) ii python3-click 8.1.6-2 all Wrapper around optparse for command line utilities - Python 3.x ii python3-colorama 0.4.6-4 all Cross-platform colored terminal text in Python - Python 3.x ii python3-commandnotfound 23.04.0 all Python 3 bindings for command-not-found. ii python3-configobj 5.0.8-3 all simple but powerful config file reader and writer for Python 3 ii python3-cryptography 41.0.7-4ubuntu0.1 amd64 Python library exposing cryptographic recipes and primitives (Python 3) ii python3-cups:amd64 2.0.1-5build6 amd64 Python3 bindings for CUPS ii python3-cupshelpers 1.5.18-1ubuntu9 all Python utility modules around the CUPS printing system ii python3-dateutil 2.8.2-3ubuntu1 all powerful extensions to the standard Python 3 datetime module ii python3-dbus 1.3.2-5build3 amd64 simple interprocess messaging system (Python 3 interface) ii python3-debconf 1.5.86ubuntu1 all interact with debconf from Python 3 ii python3-debian 0.1.49ubuntu2 all Python 3 modules to work with Debian-related data formats ii python3-defer 1.0.6-2.1ubuntu1 all Small framework for asynchronous programming (Python 3) ii python3-dev 3.12.3-0ubuntu2 amd64 header files and a static library for Python (default) ii python3-distro 1.9.0-1 all Linux OS platform information API ii python3-distro-info 1.7build1 all information about distributions' releases (Python 3 module) ii python3-distupgrade 1:24.04.26 all manage release upgrades ii python3-fasteners 0.18-2 all provides useful locks - Python 3.x ii python3-gdbm:amd64 3.12.3-0ubuntu1 amd64 GNU dbm database support for Python 3.x ii python3-gi 3.48.2-1 amd64 Python 3 bindings for gobject-introspection libraries ii python3-httplib2 0.20.4-3 all comprehensive HTTP client library written for Python3 ii python3-ibus-1.0 1.5.29-2 all Intelligent Input Bus - introspection overrides for Python (Python 3) ii python3-idna 3.6-2ubuntu0.1 all Python IDNA2008 (RFC 5891) handling (Python 3) ii python3-jinja2 3.1.2-1ubuntu1.3 all small but fast and easy to use stand-alone template engine ii python3-json-pointer 2.0-0ubuntu1 all resolve JSON pointers - Python 3.x ii python3-jsonpatch 1.32-3 all library to apply JSON patches - Python 3.x ii python3-jsonschema 4.10.3-2ubuntu1 all An(other) implementation of JSON Schema (Draft 3, 4, 6, 7) ii python3-jwt 2.7.0-1 all Python 3 implementation of JSON Web Token ii python3-launchpadlib 1.11.0-6 all Launchpad web services client library (Python 3) ii python3-lazr.restfulclient 0.14.6-1 all client for lazr.restful-based web services (Python 3) ii python3-lazr.uri 1.0.6-3 all library for parsing, manipulating, and generating URIs ii python3-louis 3.29.0-1build1 all Python bindings for liblouis ii python3-mako 1.3.2-1 all fast and lightweight templating for the Python 3 platform ii python3-markdown-it 3.0.0-2 all Python port of markdown-it and some its associated plugins ii python3-markupsafe 2.1.5-1build2 amd64 HTML/XHTML/XML string library ii python3-mdurl 0.1.2-1 all Python port of the JavaScript mdurl package rF python3-minimal 3.12.3-0ubuntu2 amd64 minimal subset of the Python language (default python3 version) ii python3-monotonic 1.6-2 all implementation of time.monotonic() - Python 3.x ii python3-nacl 1.5.0-4build1 amd64 Python bindings to libsodium (Python 3) ii python3-netaddr 0.8.0-2ubuntu1 all manipulation of various common network address notations (Python 3) ii python3-netifaces:amd64 0.11.0-2build3 amd64 portable network interface information - Python 3.x ii python3-netplan 1.1.2-2~ubuntu24.04.1 amd64 Declarative network configuration Python bindings ii python3-oauthlib 3.2.2-1 all generic, spec-compliant implementation of OAuth for Python3 ii python3-olefile 0.46-3 all Python module to read/write MS OLE2 files ii python3-paramiko 2.12.0-2ubuntu4.1 all Make ssh v2 connections (Python 3) ii python3-pexpect 4.9-2 all Python 3 module for automating interactive applications ii python3-pil:amd64 10.2.0-1ubuntu1 amd64 Python Imaging Library (Python3) ii python3-pip 24.0+dfsg-1ubuntu1.2 all Python package installer ii python3-pip-whl 24.0+dfsg-1ubuntu1.2 all Python package installer (pip wheel) ii python3-pkg-resources 68.1.2-2ubuntu1.2 all Package Discovery and Resource Access using pkg_resources ii python3-problem-report 2.28.1-0ubuntu3.7 all Python 3 library to handle problem reports ii python3-ptyprocess 0.7.0-5 all Run a subprocess in a pseudo terminal from Python 3 ii python3-pygments 2.17.2+dfsg-1 all syntax highlighting package written in Python 3 ii python3-pyparsing 3.1.1-1 all alternative to creating and executing simple grammars - Python 3.x ii python3-pyrsistent:amd64 0.20.0-1build2 amd64 persistent/functional/immutable data structures for Python ii python3-requests 2.31.0+dfsg-1ubuntu1.1 all elegant and simple HTTP library for Python3, built for human beings ii python3-rich 13.7.1-1 all render rich text, tables, progress bars, syntax highlighting, markdown and more ii python3-serial 3.5-2 all pyserial - module encapsulating access for the serial port ii python3-setuptools 68.1.2-2ubuntu1.2 all Python3 Distutils Enhancements ii python3-setuptools-whl 68.1.2-2ubuntu1.2 all Python Distutils Enhancements (wheel package) ii python3-six 1.16.0-4 all Python 2 and 3 compatibility library ii python3-software-properties 0.99.49.2 all manage the repositories that you install software from ii python3-speechd 0.12.0~rc2-2build3 all Python interface to Speech Dispatcher ii python3-sss 2.9.4-1.1ubuntu6.2 amd64 Python3 module for the System Security Services Daemon ii python3-systemd 235-1build4 amd64 Python 3 bindings for systemd ii python3-typing-extensions 4.10.0-1 all Backported and Experimental Type Hints for Python ii python3-tz 2024.1-2 all Python3 version of the Olson timezone database ii python3-uno 4:24.2.7-0ubuntu0.24.04.4 amd64 Python-UNO bridge ii python3-update-manager 1:24.04.12 all Python 3.x module for update-manager ii python3-urllib3 2.0.7-1ubuntu0.2 all HTTP library with thread-safe connection pooling for Python3 ii python3-wadllib 1.3.6-5 all Python 3 library for navigating WADL files ii python3-wheel 0.42.0-2 all built-package format for Python ii python3-xdg 0.28-2 all Python 3 library to access freedesktop.org standards ii python3-xkit 0.5.0ubuntu6 all library for the manipulation of xorg.conf files (Python 3) ii python3-yaml 6.0.1-2build2 amd64 YAML parser and emitter for Python3 ii python3.10-minimal 3.10.4-3 amd64 Minimal subset of the Python language (version 3.10) ii python3.12 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (version 3.12) ii python3.12-dev 3.12.3-1ubuntu0.7 amd64 Header files and a static library for Python (v3.12) ii python3.12-minimal 3.12.3-1ubuntu0.7 amd64 Minimal subset of the Python language (version 3.12) ii python3.12-venv 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (pyvenv binary, version 3.12)
最新发布
07-09
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值