原创文章,转帖请注明出处:
http://blog.youkuaiyun.com/palm_civet/archive/2010/11/20/6023857.aspx
根据维基百科上的一篇文章编写,比较了一下gbk字符集,utf8检测与gbk冲突的汉字不多基本可以满足要求了。
gbk:
range | byte 1 | byte 2 | code points | characters | |||
---|---|---|---|---|---|---|---|
GB 18030 | GBK 1.0 | Codepage 936 | GB 2312 | ||||
Level GBK/1 | A1 –A9 | A1 –FE | 846 | 728 | 717 | 702 | 682 |
Level GBK/2 | B0 –F7 | A1 –FE | 6,768 | 6,763 | 6,763 | 6,763 | |
Level GBK/3 | 81 –A0 | 40 –FE except 7F | 6,080 | 6,080 | 6,080 | ||
Level GBK/4 | AA –FE | 40 –A0 except 7F | 8,160 | 8,160 | 8,080 | ||
Level GBK/5 | A8 –A9 | 40 –A0 except 7F | 192 | 166 | 166 | ||
user-defined | AA –AF | A1 –FE | 564 | ||||
user-defined | F8 –FE | A1 –FE | 658 | ||||
user-defined | A1 –A7 | 40 –A0 except 7F | 672 | ||||
total: | 23,940 | 21,897 | 21,886 | 21,791 | 7,445 |
utf8:
UTF-8 byte range | Interpretation | ||
---|---|---|---|
Binary | Hex | Decimal | |
00000000-01111111 | 00-7F | 0-127 | Single-byte encoding (compatible with US-ASCII) |
10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of 2-byte sequence, but would encode a code point ≤ 7F |
11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence (including invalid code points 110000 thru 13FFFF) |
11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for code points ≥ 140000 |
11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
实现代码