Why does wide file-stream in C++ narrow written data by default?

本文探讨了C++标准库中使用宽字符进行文件操作时,默认将wchar_t转换为普通字符的原因,涉及字符集、本地环境设置及标准库设计背景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

From: http://stackoverflow.com/questions/1509277/why-does-wide-file-stream-in-c-narrow-written-data-by-default


Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

share | improve this question

 
2  
Good question. I hope you can dig up an answer. Personally I'm leaning towards the "IOStreams is just a badly designed library" theory... ;) It probably doesn't help that Unicode wasn't exactly well established when the library was designed. They might have thought that serializing to/from plain chars was the most portable approach. –  jalf  Oct 2 '09 at 13:40
 
@jalf Thanks. I am not really proficient with streams but this question bothers me a lot :D –  AraK  Oct 2 '09 at 13:49
feedback

5 Answers

up vote 5 down vote accepted

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

Two main points:

  • IO is done in term of char.
  • it is the job of the locale to determine how wide chars are serialized
  • the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
  • there is an environment determined locale named ""

So to get anything, you have to set the locale.

If I use the simple program

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

share | improve this answer
 
I bet the output failed because it was expecting another character. And the second one is not what I would expect. unless it is completely ignoring the high bits of the wchar_t. What happens if you output c = 0xABCD; Is it encoding the CD into UTF-8 and ignoring the AB? or is the whole thing encoded. What happens when the UTF-8 character is three bytes long? –  Loki Astari  Oct 2 '09 at 16:19
 
Also I get different results. C: (ff 0a) en_US.utf8: (std::runtime_error[locale::facet::_S_create_c_locale name not valid]) –  Loki Astari  Oct 2 '09 at 16:22
 
I don't understand why C3 BF isn't the encoding of 0x00FF you where expecting. And for 0xABCD it gives EA AF 8D which is what I expected. What I didn't expect is that it allowed 0xDCBA (it is a surrogate and not a valid code point) and other invalid code points. –  AProgrammer  Oct 2 '09 at 16:22
2  
Locale name are not standardized..., so you'd need to find out if you have an utf-8 locale and what is its name and how to set it (under posix, running locale -a in the shell gives you a list). I don't have time now to find out what are the constraints on the "C" locale for wide characters -- I guess it is implementation defined.–  AProgrammer  Oct 2 '09 at 16:28
 
OK. So it is converting UCS-2 (internal fixed width format) into UTF-8 (External multibyte format). That makes some sense. Note: UCS-2 does not support surrogate pairs (it just encodes them as code points so there is not loss of information when transforming between UTF-16 and UCS-2). –  Loki Astari  Oct 2 '09 at 17:22
show 1 more comment
feedback

A very partial answer for the first question: A file is a sequence of bytes so, when dealing withwchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

For your second question:

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:

The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t>converts between the UTF-32 and UTF-8 encodings schemes.codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.

In the [locale.stdcvt] section, we find:

For the facet codecvt_utf8: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

For the facet codecvt_utf16: — The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

For the facet codecvt_utf8_utf16: — The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.

So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

share | improve this answer
 
@Éric Thanks. Finally we are getting real Unicode streams :) –  AraK  Oct 2 '09 at 13:57
 
@Éric I meant that streams are Unicode aware, as C++0x is. I'm still looking for a rationale answer about the main question. –  AraK  Oct 2 '09 at 14:07
feedback

I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)

Check out the most recent C++0x draft (N2960).

share | improve this answer
 
feedback

For your first question, this is my guess.

The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.

  • Inside your program, you should use a (fixed-width) wide-character encoding.
  • Only external storage should use (variable-width) multibyte encodings.

I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.

Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)

By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.

share | improve this answer
 
Your premises will probably not hold. UTF-16 is multibyte. Most people consider UTF-32 to wasteful for character data (I don't) so we will end up using UTF-16 and having all the extra code to handle the special corner case of surrogate pairs. –  Loki Astari  Oct 2 '09 at 16:08
 
@Martin: UTF-8 and UTF-16 are all multibyte. I didn't say they were fixed-width. I don't understand exactly what you're saying. –  ltcmelo  Oct 2 '09 at 19:28
feedback

Check this out: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

You can alter the default behavior by setting a wide char buffer, using pubsetbuf. Once you did that, the output will be wchar_t and not char.

In other words for your example you will have:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!
wchar_t buffer[128];
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)


<think>嗯,用户问的是这个`:class="{ 'menu-side-narrow': flag }"`是什么意思。首先,我需要确定用户使用的框架,因为不同的框架可能有不同的语法。不过,这种写法在Vue.js中比较常见,特别是绑定class的时候。 首先,用户可能是在Vue或者类似的框架(比如React,但React的语法不同)里看到这个。所以先假设是Vue。Vue中的v-bind:class可以用来动态切换类名。这里的对象语法,键是类名,值是布尔值,决定是否应用这个类。 接下来,用户提供的代码`:class="{ 'menu-side-narrow': flag }"`,这里用的是对象语法。键是'menu-side-narrow',对应的值是flag变量。当flag为真时,这个类会被添加到元素上;否则不添加。 需要解释清楚flag的作用,它可能是一个data属性,或者计算属性,控制侧边栏的宽度变化。例如,点击按钮切换flag的值,从而改变菜单的宽度,显示或隐藏部分内容。 用户可能想知道这个语法的作用,以及如何在实际项目中应用。可能需要进一步解释Vue的类绑定机制,以及对象语法的其他用法,比如多个类名的情况。 另外,用户可能对单引号的使用有疑问,为什么类名要用引号括起来。这时候需要说明,如果类名含有特殊字符或者连字符,比如'menu-side-narrow',用引号包裹是必要的,避免语法错误。 可能用户遇到的实际情况是,他们在查看代码时不明白这个绑定的作用,或者想自己实现类似的功能。需要给出一个具体的例子,比如当flag为true时,元素会有这个类,从而应用对应的CSS样式,改变布局或外观。 总结下来,回答需要分步骤:首先说明这是Vue的动态类绑定,解释对象语法的结构,说明flag的作用,举例应用场景,可能涉及CSS样式的变化,以及注意事项如引号的使用。</think>我来为您逐步解释 `:class="{ 'menu-side-narrow': flag }"` 的含义和作用: 1. **框架定位** 这是典型的 **Vue.js** 语法,属于动态类绑定功能。冒号 `:` 是 `v-bind:` 的简写形式,用于绑定 HTML 属性。 2. **结构拆解** - `'menu-side-narrow'`:需要动态控制的 CSS 类名 - `flag`:Vue 组件中的响应式数据变量(通常来自 `data()` 或 `computed`) - **对象语法规则**:`{ 类名: 布尔值 }`,当布尔值为 `true` 时应用该类 3. **运行逻辑** - 当 `flag === true` → 添加 `menu-side-narrow` 类 - 当 `flag === false` → 移除 `menu-side-narrow` 类 4. **应用场景示例** 典型用于侧边栏折叠功能: ```html <div class="menu-side" :class="{ 'menu-side-narrow': isCollapsed }"> <!-- 菜单内容 --> </div> ``` ```css .menu-side { width: 250px; /* 展开状态宽度 */ transition: 0.3s; } .menu-side-narrow { width: 60px; /* 折叠状态宽度 */ } ``` ```javascript data() { return { isCollapsed: false // 通过按钮切换此值 } } ``` 5. **语法注意事项** - 类名包含连字符时必须加引号 - 可同时控制多个类: `:class="{ 'class-a': flag1, 'class-b': flag2 }"` - 常配合 `@click="flag = !flag"` 实现状态切换 这个模式实现了 **样式与数据的解耦**,通过改变数据状态驱动界面变化,是 Vue 响应式编程的典型应用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值