python2 和python3 打开文件注意事项（两者decode和encode函数的区别）

最新推荐文章于 2024-12-14 14:11:25 发布

原创最新推荐文章于 2024-12-14 14:11:25 发布 · 5.1k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#Python2 #Python3

Python 专栏收录该内容

1 篇文章

订阅专栏

本文对比了Python2和Python3中打开文件时decode和encode函数的区别。在Python2中，从str解码到unicode，再从unicode编码回str；而在Python3中，从bytes解码到str，再从str编码成bytes。Python2默认编码为ASCII，Python3则为UTF-8。

比如，需要打开的数据文件puk_training.utf8的样式如图所示

一、python2 打开文件：

import sys
print(sys.getdefaultencoding()) #系统默认编码方式

f = file(".\\pku_training.utf8") #以file来打开文件
print type(f)                    #获取f的数据类型

data = f.read()[3:].decode('utf-8')  #比如读取数据的的3及其以后数据，注意decode解码
f.close()
print type(data)

data = data.encode('utf-8')  #以utf-8方式编码
print type(data)

data = data.decode('utf-8')  #以utf-8方式解码
print type(data)

tokens = data.split('  ') #以数按空格切分词，存放到list里边
print type(tokens)        #打印tokens的数据类型
print type(tokens[1])     #打印tokens里边数据的数据类型

print tokens[1].encode('utf-8') #直接打印tokens[1]会出错，需要打印str类型
print tokens[1]

输出结果：

#输出结果：
ascii

<type 'file'>

<type 'unicode'>

<type 'str'>

<type 'unicode'>

<type 'list'>

<type 'unicode'>

#测试数据类型打印
充满 #tokens[1].encode('utf-8')的值
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)  #报错，不能直接打印unicode编码

python2中，我们使用decode()和encode()来进行解码和编码

在python2中，使用unicode类型作为编码的基础类型。即

decode() encode()

str -----------> unicode ---------->str

注：python2中，不能直接打印unicode编码，需要将unicode转换成str才能进行打印输出，否则会报错。

二、python3打开文件

import sys
print(sys.getdefaultencoding()) #系统默认编码方式

f=open(".\\pku_training.utf8", encoding='utf-8') # encoding表示编码或者解码的方式，此处为解码
print(type(f))

data = f.read()[3:]
print(type(data))
f.close()

data = data.encode('utf-8')
print(type(data))

data = data.decode('utf-8')
print(type(data))

tokens = data.split('  ')
print(type(tokens))
print(type(tokens[1]))

print(tokens[1])  # str 类型可以打印
print(tokens[1].encode('utf-8')) # bytes 也可以打印

输出结果：

#输出结果：
utf-8

<class '_io.TextIOWrapper'>

<class 'str'>

<class 'bytes'>

<class 'str'>

<class 'list'>

<class 'str'>

#测试数据类型打印
充满
b'\xe5\x85\x85\xe6\xbb\xa1'

python3中，encoding表征的编码或者解码方式；

decode() encode()

bytes -------------> str ------------->bytes

注：python 3中的str类型对象有点像Python2中的unicode，而decode是将str转为unicode编码，所以str仅有一个encode方法，调用这个方法后将产生一个编码后的byte类型的字符。

python3 的print( )可以打印str和bytes行数据类型。