介绍几个常用概念:
1. unicode和str:
前者是没有编码过的字符串;后者是已经编码成某一种编码方式的字符串,例如是gbk,utf-8,ascii等编码方式的字符串。两者都是basestring的子类
2. 系统编码,代码编码,文件编码,终端输入输出编码
系统编码: 默认编码,正常情况下window系统默认是gbk,linux系统默认是utf-8,可用locale.getdefaultlocale()和locale.setdefaultlocale()来控制,与encode有关
代码编码:python代码中的编码,默认是ascii,可用"# -*- coding: utf-8 -*-"这种方式指定。python默认编码可用sys.getdefaultencoding()和sys.setdefaultencoding()来控制
文件编码:sys.getfilesystemencoding()
终端输入编码:sys.stdin.encoding
终端输出编码:sys.stdout.encoding,必须与locale编码保持一致,才能print出正确str
3. 针对编码转换,尽量在代码中全程使用unicode处理,在输入口decode为unicode,在输出口encode为相对应的str
例1:
#coding:utf-8 #由于.py文件是utf-8的,所以必须有这一句
import sys
import locale
import os
import codecs
reload(sys)
print sys.getdefaultencoding() + " - sys.getdefaultencoding()"
sys.setdefaultencoding('utf8') #影响encode()
print sys.getdefaultencoding() + " - sys.getdefaultencoding()"
print sys.stdout.encoding + " - sys.stdout.encoding:"
#sys.stdout = codecs.getwriter('utf8')(sys.stdout) #影响print
print sys.stdout.encoding + " - sys.stdout.encoding:"
u = u'中国'
print u + " - u"
a = '中国'
print a + " - a"
print a.decode('utf-8') + " - a.decode('utf-8')"
print a.decode('utf-8').encode('gbk') + " - a.decode('utf-8').encode('gbk')"
print a.decode('utf-8').encode('utf-8') + " - a.decode('utf-8').encode('utf-8')"
print a.decode('utf-8').encode() + " - a.decode('utf-8').encode()"
print (sys.stdout.encoding) + " - (sys.stdout.encoding)"
print (sys.stdout.isatty())
print (locale.getpreferredencoding())
print (sys.getfilesystemencoding())
结果:
1. 终端:utf-8 locale:gbk
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
GBK - sys.stdout.encoding:
GBK - sys.stdout.encoding:
�й� - u
中国 - a
�й� - a.decode('utf-8')
�й� - a.decode('utf-8').encode('gbk')
中国 - a.decode('utf-8').encode('utf-8')
中国 - a.decode('utf-8').encode()
GBK - (sys.stdout.encoding)
True
GBK
utf-8
2. 终端:utf-8 locale:utf-8
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
UTF-8 - sys.stdout.encoding:
UTF-8 - sys.stdout.encoding:
中国 - u
中国 - a
中国 - a.decode('utf-8')
�й� - a.decode('utf-8').encode('gbk')
中国 - a.decode('utf-8').encode('utf-8')
中国 - a.decode('utf-8').encode()
UTF-8 - (sys.stdout.encoding)
True
UTF-8
utf-8
3. 终端:gbk locale:gbk
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
GBK - sys.stdout.encoding:
GBK - sys.stdout.encoding:
中国 - u
涓???? - a
中国 - a.decode('utf-8')
中国 - a.decode('utf-8').encode('gbk')
涓???? - a.decode('utf-8').encode('utf-8')
涓???? - a.decode('utf-8').encode()
GBK - (sys.stdout.encoding)
True
GBK
utf-8
4. 终端:gbk locale:utf-8
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
UTF-8 - sys.stdout.encoding:
UTF-8 - sys.stdout.encoding:
涓???? - u
涓???? - a
涓???? - a.decode('utf-8')
中国 - a.decode('utf-8').encode('gbk')
涓???? - a.decode('utf-8').encode('utf-8')
涓???? - a.decode('utf-8').encode()
UTF-8 - (sys.stdout.encoding)
True
UTF-8
utf-8