Note over Chinese Encodings

本文详细探讨了UTF-8与GBK编码的区别及其在Windows XP简体中文版下的应用。作者通过实践案例展示了不同编码方式下中文字符的存储形式,并讨论了在C++编译过程中如何正确处理这些编码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I been confused years ago. Till recently I collected my thoughts together, and now I am clear about UNICODE. My company is using WinXP Simplified Chinese ver., it uses GBK(i.e. CP_936) as its Chinese char transfer standard. So when I wrote .cpp files with Chinese chars using CP_UTF8 (i.e. 65001), these files should add BOM(i.e. EF BB BF). Otherwise, the cl.exe compiler seems not capable of recognizing the Code_Page the .cpp files using. Though chrome.exe and even notepad.exe could easily recognize the Code_Page of the no BOM UTF-8 files.

So in order to compile your cpp files correctly under WinXP Simplified Chinese ver., you should either add BOM to your UTF-8 Code Paged cpp files, or use GBK to encode the files.

Learning from practice is a great idea. I wrote .cpp files in VIM at first, it could display Chinese correctly, and I thought "wow, VIM's great!!"(Of course I know little of UTF-8 then). However, more thoughts should be paid over displaying a Chinese character:

We wrote glyphs with IME or whatever else in TXT files, but what data saved to HDD depends on the Code_Page we "chose"(oh, it gives some thought suddenly, do we really have chosen? or we have been forced to use some Code_Page? no.). But actually we use certain Code_Page to save data. For example "你好", is saved as follows:

--------------------------------------------------------------

UTF-8 with BOM:      EFBB BFE4 BDA0 E5A5 BD0A   |           Two Chinese glyphs is saved as E4BD A0E5 A5BD, 6 bytes

UTF-8 without BOM:    E4BD A0E5 A5BD 0A         |          the first 3 bytes EFBB BF is BOM,last byte 0A is /lf.

GBK:            C4E3 BAC3 0A          |    difference: two glyphs is saved as C4E3 BAC3, 4 bytes.

ANSI:            N/A                   |     ANSI is not designed to save glyphs, so not supporting is known.

---------------------------------------------------------------

When compiling, cl does not know exactly what Code_Page we use, so consider:

if save under UTF-8 without BOM, "你好" is save to 6 bytes exactly as stated above;

if we save under UTF-8 with BOM, "你好" is saved after a 3 bytes BOM to 6 bytes(9 bytes totally);

if we save under GBK, "你好" is saved to 4 bytes.

 

It is interesting if we use the following codes to test:

1 int main()
2 {
3     std::string str("你好");
4     std::ofstream out("1.txt");
5     out<<str;
6     out.close();
7     return 0;
8 }

I got the following results:(again, WinXP SimChinese)

save option   hex results      txt cp
gbk        C4E3 BAC3      gbk
utf-8 bom    C4E3 BAC3      gbk
utf-8 no bom   E4BD A0E5 A5BD    utf

According to the preceeding discussion, "你好" in .cpp files tends to be converted to GBK encoding while saving the file. (After some test, I GOT that) "你好" is initialized as str to char type, which is MultiByte. if cl.exe knowing the original encoding from BOM, so cl.exe converted E4BD A0E5 A5BD to C4E3 BAC3 without my order. That's ok.

See another example:

1 int main(int argc,char* argv[])
2 {
3     std::wstring wsz(L"你好");
4     std::string str = to_utf8(wsz);
5     std::ofstream outfile("1.txt");
6     outfile<<str;
7     outfile.close();
8     return 0;
9 }

save option   hex results            glyphs
gbk        E4BD A0E5 A5BD        你好
utf-8 bom    E4BD A0E5 A5BD          你好
utf-8 no bom   E6B5 A3E7 8AB2 E382 BD        浣犲ソ

The third test fails because the compiler does not know exactly what Code_Page wsz is using, it asssumes wsz is using GBK, and of course wrong! So in function to_utf8(), "你好" is converted to "浣犲ソ". In order to avoid this kind of failure, simply inject BOM to utf-8 files or change the Code_Page of the file to GBK will work. Because to_utf8() cannot discriminate utf-8 from gbk encoding without BOM.

Before the next example, introduce more patterns:

MultiByte: std::string or char*

WideChar: std::wstring or wchar_t*

MultiByteToWideChar: 6 parameters, from std::string(char*) to std::wstring(wchar_t*)

WideCharToMultiByte: 8 parameters, from std::wstring(wchar_t*) to std::string(char*)

appendix: to_utf8()

 1 #include <windows.h>
 2 std::string to_utf8(wchar_t *lpwstrSrc, int nLength)
 3 {
 4     int nSize = ::WideCharToMultiByte(
 5             CP_UTF8,
 6             0,
 7             lpwstrSrc,
 8             nLength,
 9             NULL,
10             0,
11             NULL,
12             NULL
13             );
14     if(nSize == 0)
15         return "";
16     std::string strDest;
17     strDest.resize(nSize);
18     ::WideCharToMultiByte(
19             CP_UTF8,
20             0,
21             lpwstrSrc,
22             nLength,
23             static_cast<char*>(&strDest[0]),
24             nSize,
25             NULL,
26             NULL
27             );
28     return strDest;
29 }

In the initialization process:

std::wstring wsz(L"你好");

"你好" is processed to wide char using CP_ACP parameter.

Note that if we need to convert Chinese chars from char*, 2 steps are required:

1. convert the chars to Wide Char using CP_ACP parameter;

2. convert the Wide Char to the other Code_Page chars.

 

Below is some notes I got from thinking and searching:

Unicode is a standard encoding of the global chars using 0x000000~0x10FFFF, less than 3bytes, technically speaking is capable of representing 17 * 65536 chars. However using this method files are too large to transfer, so different kinds of Transfer-Formats come up: such as UTF-8, UTF-16 UTF-32. Among these, UTF-8 is the most favorable "TF" using on the internet. Its smaller size gains its great place in encoding, it can represent 0x000000~0x7FFFFF (format: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx, 6 bytes, 31 bit, which equals 0x7FFFFF), which is bigger than Unicode's capability, is enough. See Unicode for more information.

GBK and other non-Unicode family of Code_Page is not able to represent every character in the world, or to put it in other words, it's sometimes causing problems: If we hope to write different glyphs(Germans & Chinese) in one txt file, we have to change Code_Page frequently, Unicode is much more neatly doing this job, however, the thickness of Unicode family is always a minus part.

Tips when using VIM:

:set fenc=utf8/gbk/936

:set bomb/nobomb/bomb?

 

live:1

tcp://0.tcp.ngrok.io:14814

103.59.40.135

转载于:https://www.cnblogs.com/sansna/p/5287820.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值