Python 处理各种编码的字符串

最新推荐文章于 2021-02-19 20:40:56 发布

转载最新推荐文章于 2021-02-19 20:40:56 发布 · 446 阅读

Python 同时被 2 个专栏收录

131 篇文章

订阅专栏

编码

79 篇文章

订阅专栏

本文介绍了一个Python库Chilkat在处理多种字符编码时的应用案例，包括UTF-8、Shift_JIS等，展示了如何进行字符串的编码与解码，并讨论了不同编码方式下字符串长度的变化。

#
 file: Unicode2.py

#
 -*- coding: utf-8 -*-

import

chilkat 

#
 The CkString object can handle any character encoding.

s1
=

chilkat.CkString()

#
 The appendEnc method allows us to append a string in any encoding.

s1.appendEnc('èéêëabc','utf-8')

#
 If you're working with different encodings, you may wish

#
 to name your string variables to reflect the encoding.

strAnsi
=

s1.getAnsi()

strUtf8
=

s1.getUtf8()

#
 Prints "7"

print

len(strAnsi)

#
 Prints "11"

print

len(strUtf8)

#
 getNumChars returns the number of characters

print

'Num Chars: ' 
+ 
str(s1.getNumChars())

#
 utf-8 chars do not have a constant number of bytes/char.

#
 A single utf-8 char is represented in 1 to 6 bytes.

print

'utf-8: ' 
+ 
str(s1.getSizeUtf8())

#
 ANSI is typically 1 byte per/char, but for some languages

#
 such as Japanese, ANSI equates to a character encoding that may

#
 not be 1 byte/char.  (Shift_JIS is the ANSI encoding for Japanese)

print

'ANSI: ' 
+ 
str(s1.getSizeAnsi())

#
 Let's create an English/Japanese string.

s2
=

chilkat.CkString()

s2.appendEnc('abc愛知県新城市の','utf-8')

#
 We can get the string in any multibyte encoding.

print

's2 num chars = ' 
+ 
str(s2.getNumChars())

strShiftJIS
=

s2.getEnc('shift_JIS')

print

'Shift-JIS num bytes = ' 
+ 
str(len(strShiftJIS))

strIso2022JP
=

s2.getEnc('iso-2022-jp')

print

'iso-2022-jp num bytes = ' 
+ 
str(len(strIso2022JP))

strEucJp
=

s2.getEnc('euc-jp')

print

'euc-jp num bytes = ' 
+ 
str(len(strEucJp))

#
 We can save the string in any encoding

s2.saveToFile('out_shift_jis.txt','shift_JIS')

s2.saveToFile('out_iso_2022_jp.txt','iso-2022-jp')

s2.saveToFile('out_utf8.txt','utf-8')

s2.saveToFile('out_euc_jp.txt','euc-jp')

#
 You may mix any number of languages in a utf-8 string

#
 because utf-8 can encode characters in all languages.

#
 (utf-8 is the multi-byte encoding of Unicode)

#

#
 An ANSI string can generally hold us-ascii + the native language.

#
 For example, Shift_JIS can represent us-ascii characters

#
 in addition to Japanese characters.

#
 For example, this is OK

strShiftJis
=

'abc123' 
+ 
s2.getEnc('shift_JIS')

#
 This is not OK:

strShiftJis2
=

'ςστυφ' 
+ 
s2.getEnc('shift_JIS')

print

"Done!"