C Reference Manual Reading Notes: 003 Multibyte and Wide Characters

宽字符与多字节字符
本文介绍了标准C中为适应非英文字符集引入的宽字符及字符串的概念,并解释了多字节字符如何在源码和执行字符集中使用。文章还详细说明了多字节字符在标准C中的限制。

    To accommodate non-english alphabets that may contain a large number of characters, Standard C introduces wide characters and wide strings.  To present wide characters and wide strings in the external, byte-oriented world, the concept of multibyte characters is introduced.

 

    Wide Characters And Strings.  A wide character is a binary representation of an element of an extended character set. It has the integer type wchar_t which is declared in header file stddef.h. Standard C does not specify the encodingof the extended character set other than "null wide character"(zero, 0) and the existence of WEOF(-1).

 

    Multibyte Character is the representation of a wide character in either the source or execution character set.(There may be different encoding for each). A multibyte stirng is a normal C string, but whose characters can be interpreted as a series of multibyte characters. The form of multibyte characters and the mapping between multibyte and wide characters is implementation-defined. This mapping is performed for wide-character and wide string constants at compile time, and the standard library provides function that perform this mapping at run time. Multibyte characters encoding can be state dependent or independent.

 

   Standard C places some restrictions on multibyte characters:

     (1).  All characters from the standard character set must be present in the encoding.

     (2). In the initial shift state, all single-byte characters from the standard character set retain their normal interpretation and do not affect the shift state.

     (3). A byte containing all zeros is taken to be the null character regardless of shift state. No multibyte character can use a byte containing all zeros as its second or subsequent character.

   Together, these rules ensure that multibyte sequences can be processed as normal C strings(e.g. they will not contain embedded null characters ) and a C string without special multibyte codes will have the expected interpretation as a multibyte sequence.

 

    Source and execution use of multibyte characters. Multibyte character may appear in comments, idenrifiers, preprocessor header names, string constants, and character constants. Multibyte characters in the physical representation of the source are recognized and translated to the source character set before any lexical analysis, preprocessing, or even splicing of continuation lines. During process, character appearing in string and character constants are translated to the execution character set before they are interpreted as multibyte sequences.

 

这个错误信息: ``` 'gbk' codec can't encode character '\x9c' in position 0: illegal multibyte sequence ``` 表示你正在尝试**用 GBK 编码方式保存或输出一段包含非 GBK 支持字符的数据**,其中包含了像 `\x9c` 这样的字节,它在 GBK 编码中是非法或未定义的。 --- ### 🔍 错误原因分析 - **默认编码问题**:Windows 系统下 Python 默认使用 `gbk` 编码来读写文件。 - **数据来源可能包含 UTF-8 或二进制内容**:例如从网络、日志、NVMe 设备等获取的数据,不是纯中文文本。 - **`\x9c` 是一个非文本字节**,可能是压缩数据、二进制结构的一部分,或者乱码导致的无效字符。 --- ### ✅ 解决方法 #### 方法一:指定正确的编码方式(适合文本) 如果你知道你的文本是 UTF-8 格式,请在打开文件时显式指定编码: ```python with open('output.txt', 'w', encoding='utf-8') as f: f.write(your_data) ``` #### 方法二:忽略非法字符 如果数据中混杂了非文本字符,可以使用 `errors` 参数跳过无法解码的内容: ```python with open('output.txt', 'w', encoding='utf-8', errors='ignore') as f: f.write(your_data) ``` 也可以使用 `'replace'` 来替换成 ``: ```python with open('output.txt', 'w', encoding='utf-8', errors='replace') as f: f.write(your_data) ``` #### 方法三:以二进制模式写入(适合非文本数据) 如果你写入的是原始字节流(如从设备读取的 NVMe 数据),应使用二进制模式: ```python with open('output.bin', 'wb') as f: f.write(your_bytes_data) ``` --- ### 🧪 示例对比 | 写入方式 | 示例代码 | 说明 | |----------|-----------|------| | UTF-8 文本写入 | `open('f.txt', 'w', encoding='utf-8')` | 推荐通用文本处理 | | 忽略错误 | `encoding='utf-8', errors='ignore'` | 丢弃非法字符 | | 替换非法字符 | `errors='replace'` | 非法字符显示为 `` | | 二进制写入 | `open('f.bin', 'wb')` | 用于写入字节数据 | --- ### 💡 小贴士 - 如果你不确定数据的编码,先打印出来看看: ```python print(repr(your_data)) # 显示字符串中的转义字符 ``` - 对于从设备读取的字段如 `eui64`,它本身就是字节数据,不应直接写入文本文件,应该: - 转成十六进制字符串再写入 - 或使用二进制模式保存 --- ### ✅ 总结 | 情况 | 推荐做法 | |------|----------| | 数据是文本(UTF-8) | 使用 `encoding='utf-8'` | | 包含非法字符 | 加 `errors='ignore'` 或 `'replace'` | | 数据是二进制 | 使用 `'wb'` 模式写入 `.bin` 文件 | | 不确定编码 | 打印 `repr(data)` 查看内容 | ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值