python - encoding

最新推荐文章于 2025-10-09 11:28:34 发布

原创最新推荐文章于 2025-10-09 11:28:34 发布 · 973 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #unicode

Programming 专栏收录该内容

22 篇文章

订阅专栏

本文深入探讨了Python中Unicode字符打印原理，解释了编码、字符集和代码页之间的区别，并详细介绍了UTF-8编码方案及其背后的直觉。文章还提供了关于ANSI编码的澄清，并对比了ASCII、Unicode和UTF-8之间的差异。

Why does Python print unicode characters when the default encoding is ASCII?

Terminologies

Character set

A not should be used term.[1]

A “character set” is just what it says: a properly-specified list of distinct characters.
A “character set” in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).

Encoding

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.
UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

Code page

a code page is a table of values that describes the character set used for encoding a particular set of glyphs.[2]
Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language.[3]
Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s.
In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

ANSI

I have been misunderstanding the ANSI encoding.

The name “ANSI” is a misnomer, since it doesn’t correspond to any actual ANSI standard, but the name has stuck.[4]
There’s no one fixed ANSI encoding - there are lots of them. Usually when people say “ANSI” they mean “the default locale/codepage for my system” which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.[5]

UTF-8

The intuition behind UTF-8’s coding scheme.[6]

The basic rules are this:

If a byte starts with a 0 bit, it’s a single byte value less than 128.
If it starts with 11, it’s the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
3.If it starts with 10, it’s a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.