本文尚需完善,准确性不保证,请谨慎阅读。
看了各种blog解析,不如查一下文档
首先,本文针对的python3,相对于python2有一定的改动。
-
Renames unicode to str
-
class str(object=b”, encoding=’utf-8’, errors=’strict’)
返回一个指代object的字符串。encoding哈errors为None时,尝试返回object.str(),object.str()为None时返回repr(object);否则,object应该是bytes-like的,此时返回object.decode(encoding,errors)
-
bytes对象是不可变的字节序列,因此每位须在[0,255]之间。bytes只支持ascii码,其中[128,155]之间的值需要进行转义。
-
Strings are stored internally as sequences of code points in range 0x0-0x10FFFF. (See PEP 393 for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. As with other codecs, serialising a string into a sequence of bytes is known as encoding, and recreating the string from the sequence of bytes is known as decoding.
str在内存中以code points形式存在,而存储时需要转成bytes,因此需要encode和decode。
ps.而采用unicode字符集的str支持多种编码。utf-16和utf-32由于采用双字节和四字节编码,存在大小端(big endian or in little endian order)的问题, 可用BOM(‘Byte Order Mark’)解决。utf-8采用单字节编码,由于特殊的编码体系,可以直接判断大端还是小端,而不需BOM。 -
There are three main types of I/O: text I/O, binary I/O and raw I/O. These are generic categories, and various backing stores can be used for each of them. A concrete object belonging to any of these categories is called a file object. Other common terms are stream and file-like object
…..
All streams are careful about the type of data you give to them. For example giving a str object to the write() method of a binary stream will raise a TypeError. So will giving a bytes object to the write() method of a text stream.Text I/O
对text stream输入输出str。可采用encoding参数指定编码,否则采用locale.getpreferredencoding(False)作为默认编码。-
Binary I/O (also called buffered I/O) expects and produces bytes objects.
无需编/解码
-
Raw I/O (also called unbuffered I/O) is generally used as a low-level building-block for binary and text streams;
-
open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
Open file and return a corresponding file object.根据打开方式返回text stream或者bytes stream。若是text模式,若无encoding为None则使用locale.getpreferredencoding(False)
-
print(*objects, sep=’ ‘, end=’\n’, file=sys.stdout, flush=False)
Print objects to the text stream file….
All non-keyword arguments are converted to strings like str() does and written to the stream……print()会将str(object)输出到file。
-
Return the name of the current default string encoding used by the Unicode implementation.
默认编码为utf-8 -
the first group of this expression names the encoding of the source code file
源代码在保存的时候会使用各种编码,比如gbk。如果没有这个声明,python解释器并不知道源代码采用什么编码,于是默认为采用了utf-8编码。这样便容易导致错误。
文件编码:是将文件显示的字符串encode成bytes的存储到硬盘上采用的编码。
说了这么多,其实主要想理清乱码问题的产生原因。以文件读写为例。我们以open(encoding = A, mode = ‘t’)打开一个文件获得一个text stream,为了正确读写我们需要采用正确的编码。假设文件采用了B编码,