The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.
UCS2(UTF-16)的第一行和ISO/IEC 8859-1完全一样,开头的128个字符就是ASCII码。所以把WE8ISO8859P1的字符转换成UCS2只要在前面加一个0字节即可。这也是我上一篇关于英文字符集Oracle数据库OraOLEDB驱动乱码问题可行的原因。
UTF-8开头128个字符也是ASCII码。
UTF-16有两种,BMP(基本多语言平面)只有两个字节,增补字符需要四个字节。
Unicode最新版是5.0,新的版本都是在原有基础上增加字符,不改变原来字符的编码。有UTF-16和UTF-32,Windows采用的是UTF-16,有些Unix系统采用UTF-32。
UTF-8编码方式:UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
一个字节第一位为0,如果最高两位为1后跟一个0,表示由两个字节构成,依此类推。ASCII码的UTF-8编码完全一样,
汉字通常需要3-4个字节。一个字节破坏或者丢失很容易找出这个字符的边界。
UTF-16也与此类似:
四个字节的UTF-16,前两个字节为0xD800-0xDBFF,后两个字节为0xDC00-0xDFFF。
丢失一个字节也可以判断出来字符边界。
GBK这样变长编码比较麻烦,一个字节不对导致一大堆乱码。
1万+

被折叠的 条评论
为什么被折叠?



