Unicode字符集初探

最新推荐文章于 2024-07-18 09:49:21 发布

原创最新推荐文章于 2024-07-18 09:49:21 发布 · 1.3k 阅读

CC 4.0 BY-SA版权

文章标签：

#character #oracle #windows #unix #数据库 #语言

In the first version of UCS 34203 different characters are included. Of these 21204 are ideographic characters used in Chinese, Japanese and Korean, and 6656 are Korean Hangul syllabograms. To guarantee that the coding space will not be filled up even in the future -- 2 octets give 65536 different character positions -- a 4-octet form of UCS (UCS-4) is also definied.

The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.

UCS2(UTF-16)的第一行和ISO/IEC 8859-1完全一样，开头的128个字符就是ASCII码。所以把WE8ISO8859P1的字符转换成UCS2只要在前面加一个0字节即可。这也是我上一篇关于英文字符集Oracle数据库OraOLEDB驱动乱码问题可行的原因。

UTF-8开头128个字符也是ASCII码。

UTF-16有两种，BMP(基本多语言平面)只有两个字节，增补字符需要四个字节。

Unicode最新版是5.0，新的版本都是在原有基础上增加字符，不改变原来字符的编码。有UTF-16和UTF-32，Windows采用的是UTF-16，有些Unix系统采用UTF-32。

UTF-8编码方式：

   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
   0000 0000-0000 007F   0xxxxxxx
   0000 0080-0000 07FF   110xxxxx 10xxxxxx
   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx
一个字节第一位为0，如果最高两位为1后跟一个0，表示由两个字节构成，依此类推。ASCII码的UTF-8编码完全一样，
汉字通常需要3-4个字节。一个字节破坏或者丢失很容易找出这个字符的边界。

UTF-16也与此类似：
四个字节的UTF-16，前两个字节为0xD800-0xDBFF，后两个字节为0xDC00-0xDFFF。
丢失一个字节也可以判断出来字符边界。

GBK这样变长编码比较麻烦，一个字节不对导致一大堆乱码。