原创文章,转帖请注明出处:
http://blog.youkuaiyun.com/palm_civet/archive/2010/11/20/6023857.aspx
根据维基百科上的一篇文章编写,比较了一下gbk字符集,utf8检测与gbk冲突的汉字不多基本可以满足要求了。
gbk:
| range | byte 1 | byte 2 | code points | characters | |||
|---|---|---|---|---|---|---|---|
| GB 18030 | GBK 1.0 | Codepage 936 | GB 2312 | ||||
| Level GBK/1 | A1–A9 | A1–FE | 846 | 728 | 717 | 702 | 682 |
| Level GBK/2 | B0–F7 | A1–FE | 6,768 | 6,763 | 6,763 | 6,763 | |
| Level GBK/3 | 81–A0 | 40–FE except 7F | 6,080 | 6,080 | 6,080 | ||
| Level GBK/4 | AA–FE | 40–A0 except 7F | 8,160 | 8,160 | 8,080 | ||
| Level GBK/5 | A8–A9 | 40–A0 except 7F | 192 | 166 | 166 | ||
| user-defined | AA–AF | A1–FE | 564 | ||||
| user-defined | F8–FE | A1–FE | 658 | ||||
| user-defined | A1–A7 | 40–A0 except 7F | 672 | ||||
| total: | 23,940 | 21,897 | 21,886 | 21,791 | 7,445 | ||
utf8:
| UTF-8 byte range | Interpretation | ||
|---|---|---|---|
| Binary | Hex | Decimal | |
| 00000000-01111111 | 00-7F | 0-127 | Single-byte encoding (compatible with US-ASCII) |
| 10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
| 11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of 2-byte sequence, but would encode a code point ≤ 7F |
| 11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
| 11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
| 11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence (including invalid code points 110000 thru 13FFFF) |
| 11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for code points ≥ 140000 |
| 11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
| 11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
| 11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
实现代码
本文介绍了如何检测一个字节序列是否符合UTF8字符编码规范,通过比较GBK字符集和UTF8的冲突情况,提供了有效的UTF8编码检测代码实现。

被折叠的 条评论
为什么被折叠?



