string和wstring区别和使用

string和wstring区别和使用

wstring的核心是wchar:string,wstring是容器,本身没有太大的差别,产生区别的原因是容器里面存储的元素。char和wchar

char:8bit,只能直接存储ascii

wchar:可以存储unicode。

utf一个字符的长度是不固定的,是可变长的。

wstring是负责存储,不负责读取,读取需要专门的类,识别哪几个byte构成一个字符。

wchar:长度固定,2/4 byte

string​? wstring​?

  1. std::string​ is a basic\_string​ templated on a char
  2. std::wstring​ on a wchar\_t​.

char​ vs. wchar_t

  1. char​ is supposed to hold a character, usually an 8-bit character.
  2. wchar_t​ is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t​ is 4 bytes, while on Windows, it’s 2 bytes.

What about Unicode, then?

The problem is that neither char​ nor wchar_t​ is directly tied to Unicode.

On Linux?

Let’s take a Linux OS: My Ubuntu system is already Unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. a Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main()
{
const char text[] = "olé";


std::cout << "sizeof(char)    : " << sizeof(char) << "\n";
std::cout << "text            : " << text << "\n";
std::cout << "sizeof(text)    : " << sizeof(text) << "\n";
std::cout << "strlen(text)    : " << strlen(text) << "\n";

std::cout << "text(ordinals)  :";

for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}

std::cout << "\n\n";

// - - -

const wchar_t wtext[] = L"olé" ;

std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext           : " << wtext << "\n"; <- error
std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext           : " << wtext << "\n";

std::cout << "sizeof(wtext)   : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext)   : " << wcslen(wtext) << "\n";

std::cout << "wtext(ordinals) :";

for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}

std::cout << "\n\n";
}

outputs the following text:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol�
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

You’ll see the “olé” text in char​ is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I’ll let you study the wchar_t​ code as an exercise)

So, when working with a char​ on Linux, you should usually end up using Unicode without even knowing it. And as std::string​ works with char​, so std::string​ is already unicode-ready.

Note that std::string​, like the C string API, will consider the “olé” string to have 4 characters, not three. So you should be cautious when truncating/playing with Unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of applications working with char​ and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char​, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, “olé” would be “olé” in a French-localized Windows, but would be something different on an cyrillic-localized Windows (“olй” if you use Windows-1251). Thus, “historical apps” will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t​, which is 2-bytes wide and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).

Applications using char​ are said “multibyte” (because each glyph is composed of one or more char​s), while applications using wchar_t​ are said “widechar” (because each glyph is composed of one or two wchar_t​. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t​ (unless you use a framework hiding that, like GTK or QT…). The fact is that behind the scenes, Windows works with wchar_t​ strings, so even historical applications will have their char​ strings converted in wchar_t​ when using API like SetWindowText()​ (low-level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is not much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (Chinese, Japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 and occasionally 4 bytes per character (unless you’re dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

  1. When I should use std::wstring over std::string?

    On Linux? Almost never.

    On Windows? Almost always .

    On cross-platform code? Depends on your toolkit…
    unless you use a toolkit/framework saying otherwise
  2. Can std::stringhold all the ASCII character sets including special characters?
    Notice:

    A std::string​ is suitable for holding a ‘binary’ buffer, where a std::wstring​ is not!
    On Linux? Yes.

    On Windows? Only special characters are available for the current locale of the Windows user.
    Edit (After a comment from **Johann Gerell**​ ): a std::string​ will be enough to handle all char​-based strings (each char​ being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher char​s are NOT ASCII.
    2. a char​ from 0 to 127 will be held correctly
    3. a char​ from 128 to 255 will have a signification depending on your encoding (Unicode, non-Unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
  3. Is std::wstringsupported by almost all popular C++ compilers?
    Mostly, with the exception of GCC-based compilers that are ported to Windows. It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

  4. What is exactly a wide character?
    In C/C++, it’s a character typewritten wchar_t​ which is larger than the simple char​ character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending…).

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值