windows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数
在你得到UTF16的编码后,要将其转成内码需要参数codepage代码页,用他将UTF16编码和内码对应起来进行转码
但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)
所以要看Unicode Subset Bitfields来确定他的codepage,下表是我从MSDN上找到的
Bit |
Unicode |
Description |
0 |
0020 - 007e |
Basic Latin |
1 |
00a0 - 00ff |
Latin-1 Supplement |
2 |
0100 - 017f |
Latin Extended-A |
3 |
0180 - 024f |
Latin Extended-B |
4 |
0250 - 02af |
IPA Extensions |
5 |
02b0 - 02ff |
Spacing Modifier Letters |
6 |
0300 - 036f |
Combining Diacritical Marks |
7 |
0370 - 03ff |
Basic Greek |
8 |
|
Reserved |
9 |
0400 - 04ff |
Cyrillic |
10 |
0530 - 058f |
Armenian |
11 |
0590 - 05ff |
Basic Hebrew |
12 |
|
Reserved |
13 |
0600 - 06ff |
Basic Arabic |
14 |
|
Reserved |
15 |
0900 - 097f |
Devanagari |
16 |
0980 - 09ff |
Bengali |
17 |
0a00 - 0a7f |
Gurmukhi |
18 |
0a80 - 0aff |
Gujarati |
19 |
0b00 - 0b7f |
Oriya |
20 |
0b80 - 0bff |
Tamil |
21 |
0c00 - 0c7f |
Telugu |
22 |
0c80 - 0cff |
Kannada |
23 |
0d00 - 0d7f |
Malayalam |
24 |
0e00 - 0e7f |
Thai |
25 |
0e80 - 0eff |
Lao |
26 |
10a0 - 10ff |
Basic Georgian |
27 |
|
Reserved |
28 |
1100 - 11ff |
Hangul Jamo |
29 |
1e00 - 1eff |
Latin Extended Additional |
30 |
1f00 - 1fff |
Greek Extended |
31 |
2000 - 206f |
General Punctuation |
32 |
2070 - 209f |
Subscripts and Superscripts |
33 |
20a0 - 20cf |
Currency Symbols |
34 |
20d0 - 20ff |
Combining Diacritical Marks for Symbols |
35 |
2100 - 214f |
Letter-like Symbols |
36 |
2150 - 218f |
Number Forms |
37 |
2190 - 21ff |
Arrows |
38 |
2200 - 22ff |
Mathematical Operators |
39 |
2300 - 23ff |
Miscellaneous Technical |
40 |
2400 - 243f |
Control Pictures |
41 |
2440 - 245f |
Optical Character Recognition |
42 |
2460 - 24ff |
Enclosed Alphanumerics |
43 |
2500 - 257f |
Box Drawing |
44 |
2580 - 259f |
Block Elements |
45 |
25a0 - 25ff |
Geometric Shapes |
46 |
2600 - 26ff |
Miscellaneous Symbols |
47 |
2700 - 27bf |
Dingbats |
48 |
3000 - 303f |
Chinese, Japanese, and Korean (CJK) Symbols and Punctuation |
49 |
3040 - 309f |
Hiragana |
50 |
30a0 - 30ff |
Katakana |
51 |
3100 - 312f |
Bopomofo |
52 |
3130 - 318f |
Hangul Compatibility Jamo |
53 |
3190 - 319f |
CJK Miscellaneous |
54 |
3200 - 32ff |
Enclosed CJK Letters and Months |
55 |
3300 - 33ff |
CJK Compatibility |
56 |
ac00 - d7a3 |
Hangul |
57 |
d800 - dfff |
Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font. |
58 |
|
Reserved |
59 |
4e00 - 9fff |
CJK Unified Ideographs |
60 |
e000 - f8ff |
Private Use Area |
61 |
f900 - faff |
CJK Compatibility Ideographs |
62 |
fb00 - fb4f |
Alphabetic Presentation Forms |
63 |
fb50 - fdff |
Arabic Presentation Forms-A |
64 |
fe20 - fe2f |
Combining Half Marks |
65 |
fe30 - fe4f |
CJK Compatibility Forms |
66 |
fe50 - fe6f |
Small Form Variants |
67 |
fe70 - fefe |
Arabic Presentation Forms-B |
68 |
ff00 - ffef |
Halfwidth and Fullwidth Forms |
69 |
fff0 - fffd |
Specials |
70 |
0f00 - 0fcf |
Tibetan |
71 |
0700 - 074f |
Syriac |
72 |
0780 - 07bf |
Thaana |
73 |
0d80 - 0dff |
Sinhala |
74 |
1000 - 109f |
Myanmar |
75 |
1200 - 12bf |
Ethiopic |
76 |
13a0 - 13ff |
Cherokee |
77 |
1400 - 14df |
Canadian Aboriginal Syllabics |
78 |
1680 - 169f |
Ogham |
79 |
16a0 - 16ff |
Runic |
80 |
1780 - 17ff |
Khmer |
81 |
1800 - 18af |
Mongolian |
82 |
2800 - 28ff |
Braille |
83 |
a000 - a48c |
Yi |
84-122 |
|
Reserved |
123 |
|
Windows 2000/XP: Layout progress: horizontal from right to left |
124 |
|
Windows 2000/XP: Layout progress: vertical before horizontal |
125 |
|
Windows 2000/XP: Layout progress: vertical bottom to top |
126 |
|
Reserved; must be 0 |
127 |
|
Reserved; must be 1 |
下表是主要的codepage
ANSI Code-Page Identifiers
Identifier |
Meaning |
874 |
Thai |
932 |
Japanese |
936 |
Chinese (PRC, Singapore) |
949 |
Korean |
950 |
Chinese (Taiwan; Hong Kong SAR, PRC) |
1200 |
Unicode (BMP of ISO 10646) |
1250 |
Windows 3.1 Eastern European |
1251 |
Windows 3.1 Cyrillic |
1252 |
Windows 3.1 Latin 1 (US, Western Europe) |
1253 |
Windows 3.1 Greek |
1254 |
Windows 3.1 Turkish |
1255 |
Hebrew |
1256 |
Arabic |
1257 |
Baltic |
可以通过EnumSystemCodePages来枚举codepage
看起来很复杂,两个表似乎很难对应~~
自己根据一些资料经过实验总结了一部分
Unicode subrange |
Description |
Codepage |
0x00-0x007F |
Basic Latin |
0(CP_ACP) |
0x7F-0x00FF |
Latin-1 Supplement |
1252 |
0x0100-0x017F |
Latin Extended-A |
1250 |
0x0180-0x024F |
Latin Extended-B |
??? |
|
|
|
0x0370-0x03FF |
Basic Greek |
1253 |
0x0E00-0x0E7F |
Thai |
874 |
0x0590-0x05FF |
Basic Hebrew |
1255 |
0x0600-0x07FF |
Basic Arabic |
1256 |
也就只能如此了
LINUX下转码
在LINUX下转码的时候我找到了iconv族函数,用起来倒也简单
先打开iconv_t iconv_open(const char *tocode, const char *fromcode);
再转码size_t iconv(iconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
最后关 int iconv_close(iconv_t cd);
尤其需要注意的是iconv_open的参数,两个code很容易让人出错。
这里的code和windows下的codepage很象
可以iconv –list这个命令来显示他所有的code
简单的可以用windows的codepage前加个CP,
例如 codepage是1250 code是“CP1250”
虽然简单但我还是为我的马虎(两个code下反了)付出了时间的代价
还有一个问题,我还没找到答案
一般在调用iconv的时候inbuf是char*的,他里面存放数据的顺序是先低位后高位
假如“尽”gb2312编码是0xBEA1在inbuf中应该存成
inbuf[0] = 0xBE;inbuf[1] = 0xA1;
但当编码为UTF16的时候,进iconv的inbuf的顺序是先高位后低位
例如“尽” UTF16编码是0x5C3D 在 inbuf中存成了
inbuf[0] = 0x3D;inbuf[1] = 0x5C;
搞不明白为什么,big endian small endian和CPU有关,这里又算怎么一回事~~只好把这部分的unicode挑出来了。