windows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数
在你得到UTF16的编码后,要将其转成内码需要参数codepage代码页,用他将UTF16编码和内码对应起来进行转码
但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)
所以要看Unicode Subset Bitfields来确定他的codepage,下表是我从MSDN上找到的
|
Bit |
Unicode |
Description |
|
0 |
0020 - 007e |
Basic Latin |
|
1 |
00a0 - 00ff |
Latin-1 Supplement |
|
2 |
0100 - 017f |
Latin Extended-A |
|
3 |
0180 - 024f |
Latin Extended-B |
|
4 |
0250 - 02af |
IPA Extensions |
|
5 |
02b0 - 02ff |
Spacing Modifier Letters |
|
6 |
0300 - 036f |
Combining Diacritical Marks |
|
7 |
0370 - 03ff |
Basic Greek |
|
8 |
|
Reserved |
|
9 |
0400 - 04ff |
Cyrillic |
|
10 |
0530 - 058f |
Armenian |
|
11 |
0590 - 05ff |
Basic Hebrew |
|
12 |
|
Reserved |
|
13 |
0600 - 06ff |
Basic Arabic |
|
14 |
|
Reserved |
|
15 |
0900 - 097f |
Devanagari |
|
16 |
0980 - 09ff |
Bengali |
|
17 |
0a00 - 0a7f |
Gurmukhi |
|
18 |
0a80 - 0aff |
Gujarati |
|
19 |
0b00 - 0b7f |
Oriya |
|
20 |
0b80 - 0bff |
Tamil |
|
21 |
0c00 - 0c7f |
Telugu |
|
22 |
0c80 - 0cff |
Kannada |
|
23 |
0d00 - 0d7f |
Malayalam |
|
24 |
0e00 - 0e7f |
Thai |
|
25 |
0e80 - 0eff |
Lao |
|
26 |
10a0 - 10ff |
Basic Georgian |
|
27 |
|
Reserved |
|
28 |
1100 - 11ff |
Hangul Jamo |
|
29 |
1e00 - 1eff |
Latin Extended Additional |
|
30 |
1f00 - 1fff |
Greek Extended |
|
31 |
2000 - 206f |
General Punctuation |
|
32 |
2070 - 209f |
Subscripts and Superscripts |
|
33 |
20a0 - 20cf |
Currency Symbols |
|
34 |
20d0 - 20ff |
Combining Diacritical Marks for Symbols |
|
35 |
2100 - 214f |
Letter-like Symbols |
|
36 |
2150 - 218f |
Number Forms |
|
37 |
2190 - 21ff |
Arrows |
|
38 |
2200 - 22ff |
Mathematical Operators |
|
39 |
2300 - 23ff |
Miscellaneous Technical |
|
40 |
2400 - 243f |
Control Pictures |
|
41 |
2440 - 245f |
Optical Character Recognition |
|
42 |
2460 - 24ff |
Enclosed Alphanumerics |
|
43 |
2500 - 257f |
Box Drawing |
|
44 |
2580 - 259f |
Block Elements |
|
45 |
25a0 - 25ff |
Geometric Shapes |
|
46 |
2600 - 26ff |
Miscellaneous Symbols |
|
47 |
2700 - 27bf |
Dingbats |
|
48 |
3000 - 303f |
Chinese, Japanese, and Korean (CJK) Symbols and Punctuation |
|
49 |
3040 - 309f |
Hiragana |
|
50 |
30a0 - 30ff |
Katakana |
|
51 |
3100 - 312f |
Bopomofo |
|
52 |
3130 - 318f |
Hangul Compatibility Jamo |
|
53 |
3190 - 319f |
CJK Miscellaneous |
|
54 |
3200 - 32ff |
Enclosed CJK Letters and Months |
|
55 |
3300 - 33ff |
CJK Compatibility |
|
56 |
ac00 - d7a3 |
Hangul |
|
57 |
d800 - dfff |
Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font. |
|
58 |
|
Reserved |
|
59 |
4e00 - 9fff |
CJK Unified Ideographs |
|
60 |
e000 - f8ff |
Private Use Area |
|
61 |
f900 - faff |
CJK Compatibility Ideographs |
|
62 |
fb00 - fb4f |
Alphabetic Presentation Forms |
|
63 |
fb50 - fdff |
Arabic Presentation Forms-A |
|
64 |
fe20 - fe2f |
Combining Half Marks |
|
65 |
fe30 - fe4f |
CJK Compatibility Forms |
|
66 |
fe50 - fe6f |
Small Form Variants |
|
67 |
fe70 - fefe |
Arabic Presentation Forms-B |
|
68 |
ff00 - ffef |
Halfwidth and Fullwidth Forms |
|
69 |
fff0 - fffd |
Specials |
|
70 |
0f00 - 0fcf |
Tibetan |
|
71 |
0700 - 074f |
Syriac |
|
72 |
0780 - 07bf |
Thaana |
|
73 |
0d80 - 0dff |
Sinhala |
|
74 |
1000 - 109f |
Myanmar |
|
75 |
1200 - 12bf |
Ethiopic |
|
76 |
13a0 - 13ff |
Cherokee |
|
77 |
1400 - 14df |
Canadian Aboriginal Syllabics |
|
78 |
1680 - 169f |
Ogham |
|
79 |
16a0 - 16ff |
Runic |
|
80 |
1780 - 17ff |
Khmer |
|
81 |
1800 - 18af |
Mongolian |
|
82 |
2800 - 28ff |
Braille |
|
83 |
a000 - a48c |
Yi |
|
84-122 |
|
Reserved |
|
123 |
|
Windows 2000/XP: Layout progress: horizontal from right to left |
|
124 |
|
Windows 2000/XP: Layout progress: vertical before horizontal |
|
125 |
|
Windows 2000/XP: Layout progress: vertical bottom to top |
|
126 |
|
Reserved; must be 0 |
|
127 |
|
Reserved; must be 1 |
下表是主要的codepage
ANSI Code-Page Identifiers
|
Identifier |
Meaning |
|
874 |
Thai |
|
932 |
Japanese |
|
936 |
Chinese (PRC, Singapore) |
|
949 |
Korean |
|
950 |
Chinese (Taiwan; Hong Kong SAR, PRC) |
|
1200 |
Unicode (BMP of ISO 10646) |
|
1250 |
Windows 3.1 Eastern European |
|
1251 |
Windows 3.1 Cyrillic |
|
1252 |
Windows 3.1 Latin 1 (US, Western Europe) |
|
1253 |
Windows 3.1 Greek |
|
1254 |
Windows 3.1 Turkish |
|
1255 |
Hebrew |
|
1256 |
Arabic |
|
1257 |
Baltic |
可以通过EnumSystemCodePages来枚举codepage
看起来很复杂,两个表似乎很难对应~~
自己根据一些资料经过实验总结了一部分
|
Unicode subrange |
Description |
Codepage |
|
0x00-0x007F |
Basic Latin |
0(CP_ACP) |
|
0x7F-0x00FF |
Latin-1 Supplement |
1252 |
|
0x0100-0x017F |
Latin Extended-A |
1250 |
|
0x0180-0x024F |
Latin Extended-B |
??? |
|
|
|
|
|
0x0370-0x03FF |
Basic Greek |
1253 |
|
0x0E00-0x0E7F |
Thai |
874 |
|
0x0590-0x05FF |
Basic Hebrew |
1255 |
|
0x0600-0x07FF |
Basic Arabic |
1256 |
也就只能如此了
LINUX下转码
在LINUX下转码的时候我找到了iconv族函数,用起来倒也简单
先打开iconv_t iconv_open(const char *tocode, const char *fromcode);
再转码size_t iconv(iconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
最后关 int iconv_close(iconv_t cd);
尤其需要注意的是iconv_open的参数,两个code很容易让人出错。
这里的code和windows下的codepage很象
可以iconv –list这个命令来显示他所有的code
简单的可以用windows的codepage前加个CP,
例如 codepage是1250 code是“CP1250”
虽然简单但我还是为我的马虎(两个code下反了)付出了时间的代价
还有一个问题,我还没找到答案
一般在调用iconv的时候inbuf是char*的,他里面存放数据的顺序是先低位后高位
假如“尽”gb2312编码是0xBEA1在inbuf中应该存成
inbuf[0] = 0xBE;inbuf[1] = 0xA1;
但当编码为UTF16的时候,进iconv的inbuf的顺序是先高位后低位
例如“尽” UTF16编码是0x5C3D 在 inbuf中存成了
inbuf[0] = 0x3D;inbuf[1] = 0x5C;
搞不明白为什么,big endian small endian和CPU有关,这里又算怎么一回事~~只好把这部分的unicode挑出来了。
425

被折叠的 条评论
为什么被折叠?



