Python编码与文件IO详解-优快云博客

本文链接：https://blog.youkuaiyun.com/u010620604/article/details/48292365

本文尚需完善，准确性不保证，请谨慎阅读。
看了各种blog解析，不如查一下文档
首先，本文针对的python3，相对于python2有一定的改动。

unicode

Renames unicode to str
str

class str(object=b”, encoding=’utf-8’, errors=’strict’)

返回一个指代object的字符串。encoding哈errors为None时，尝试返回object.str()，object.str()为None时返回repr(object)；否则，object应该是bytes-like的，此时返回object.decode(encoding,errors)
Bytes

bytes对象是不可变的字节序列，因此每位须在[0,255]之间。bytes只支持ascii码，其中[128,155]之间的值需要进行转义。
Encodings and Unicode

Strings are stored internally as sequences of code points in range 0x0-0x10FFFF. (See PEP 393 for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. As with other codecs, serialising a string into a sequence of bytes is known as encoding, and recreating the string from the sequence of bytes is known as decoding.

str在内存中以code points形式存在，而存储时需要转成bytes，因此需要encode和decode。
ps.而采用unicode字符集的str支持多种编码。utf-16和utf-32由于采用双字节和四字节编码，存在大小端(big endian or in little endian order)的问题，可用BOM(‘Byte Order Mark’)解决。utf-8采用单字节编码，由于特殊的编码体系,可以直接判断大端还是小端，而不需BOM。
I/O

There are three main types of I/O: text I/O, binary I/O and raw I/O. These are generic categories, and various backing stores can be used for each of them. A concrete object belonging to any of these categories is called a file object. Other common terms are stream and file-like object
…..
All streams are careful about the type of data you give to them. For example giving a str object to the write() method of a binary stream will raise a TypeError. So will giving a bytes object to the write() method of a text stream.
- Text I/O
  对text stream输入输出str。可采用encoding参数指定编码，否则采用locale.getpreferredencoding(False)作为默认编码。
- Binary I/O
  
  Binary I/O (also called buffered I/O) expects and produces bytes objects.
  
  无需编/解码
- Raw I/O
  
  Raw I/O (also called unbuffered I/O) is generally used as a low-level building-block for binary and text streams;
os.open()

open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
Open file and return a corresponding file object.

根据打开方式返回text stream或者bytes stream。若是text模式，若无encoding为None则使用locale.getpreferredencoding(False)
print()

print(*objects, sep=’ ‘, end=’\n’, file=sys.stdout, flush=False)
Print objects to the text stream file….
All non-keyword arguments are converted to strings like str() does and written to the stream……

print()会将str(object)输出到file。
sys.getdefaultencoding()

Return the name of the current default string encoding used by the Unicode implementation.
默认编码为utf-8
Encoding declarations

the first group of this expression names the encoding of the source code file
源代码在保存的时候会使用各种编码，比如gbk。如果没有这个声明，python解释器并不知道源代码采用什么编码，于是默认为采用了utf-8编码。这样便容易导致错误。

文件编码：是将文件显示的字符串encode成bytes的存储到硬盘上采用的编码。
说了这么多，其实主要想理清乱码问题的产生原因。以文件读写为例。我们以open(encoding = A, mode = ‘t’)打开一个文件获得一个text stream，为了正确读写我们需要采用正确的编码。假设文件采用了B编码，

python3编码