utf8的编码原理

最新推荐文章于 2023-12-31 08:15:00 发布

转载最新推荐文章于 2023-12-31 08:15:00 发布 · 998 阅读

165 篇文章

订阅专栏

utf8的编码原理大概意思：
在UTF8中，字符使用1到6个八位序列编码。
只有一个八位序列的字符，一个高位置为0，剩下的7位用于字符值的编码（能表示ASCII）
一N个八位的序列（N>1），开头的八位中高位有n位置为1，相邻的一位置为0，这个八位中
剩下的位用于字符值的编码，接着的N-1个八位序列中都在最高位置为1，相邻位置为0，每一个八位序列剩下的6
位包含字符值的编码位。

只有一个八位序列，则有7位编码位，表示值为127以内的字符
两个八位序列，第一个八位剩下5位，第二个序列剩下6位，共11位可以表示128到2048-1以内的字符
三个八位序列，第一个八位剩下4位，第二个序列剩下6位，第三个序列剩下6位，共16位，可以表示2048到65536-1以内的字符。
以此类推。

最大6个八位序列，用于字符的编码值有1+5*6=31位，才可以表示2147483648-1以内的字符。
===================================================================================
摘取：RFC2044 - UTF-8
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
   The only octet of a "sequence" of one has the higher-order bit set to
   0, the remaining 7 bits being used to encode the character value. In
   a sequence of n octets, n>1, the initial octet has the n higher-order
   bits set to 1, followed by a bit set to 0. The remaining bit(s) of
   that octet contain bits from the value of the character to be
   encoded. The following octet(s) all have the higher-order bit set to
   1 and the following bit set to 0, leaving 6 bits in each to contain
   bits from the character to be encoded.

   The table below summarizes the format of these different octet types.
   The letter x indicates bits available for encoding bits of the UCS-4
   character value.

   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
   0000 0000-0000 007F   0xxxxxxx
   0000 0080-0000 07FF   110xxxxx 10xxxxxx
   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx